Newbi processing DOS legacy print file

New to Python,
Can someone please point out the error of my ways in this code:

import re
file_lines =[]
#input is DOS legacy print file 
with open("S285IBUD.PRN", 'r') as the_file: 
     for line in the_file:
         #check if new page 
         match = re.findall('0c', line)
         if match: cnt = 0
         #check if new line
         match = re.findall('0d0a', line)
         if match: cnt+1
         #start processing from 7th line, where line length is > 4
         if cnt > 6 and len(line) > 4:
         # process the line in some way  
            #file_lines.append(line) 
            print("line no.: ",  cnt,  line)

in VS Code terminal I get error (first of many)

Traceback (most recent call last):
File “c:\python-v3-12.2\prt-2-csv\prt2csv.py”, line 4, in
for line in the_file:
^^^^^^^^^^^^^^^^^^^^^

Any help much appreciated

This isn’t enough context - your code looks supericially ok to my eye.

Please paste the full error output from VSCode. In particular the
exception, but the entire thing provides useful context.

Apart from what Cameron has said, I suspect that the file is binary data, not text.

My crystal ball says that your file is actually OEM encoded in codepage 437. But Python knows more about what’s going on than my crystal ball does, so start by reading the ENTIRE error message, and if you don’t understand it, post the ENTIRE error message here.

Inside a code block, like you (correctly, thanks) did for your code

I can note a couple problems that pop out already, irrespective of what your actual specific error message is:

You should always specify an encoding when reading a file as text, or your file is liable to end up mangled or raise an error on reading. This is the most probable cause of the specific traceback you’re seeing.

While not wrong per say, stuffing everything in to one line like this is generally considered bad style and makes it harder to follow the logic of your program

This line does nothing, as it is an expression that produces the result of one greater than cnt but then does nothing with it and discards it, rather than assigning it to a variable. You probably mean cnt += 1 to add 1 to the cnt value.

Here’s the entire error:
PS C:\python-v3-12.2\prt-2-csv> & c:/python-v3-12.2/prt-2-csv/.venv/Scripts/Activate.ps1
(.venv) PS C:\python-v3-12.2\prt-2-csv> & c:/python-v3-12.2/prt-2-csv/.vTraceback (most recent call last):
File “c:\python-v3-12.2\prt-2-csv\prt2csv.py”, line 4, in
for line in the_file:
^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\david\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1255.py”, line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x8c in position 10: character maps to
(.venv) PS C:\python-v3-12.2\prt-2-csv> & c:/python-v3-12.2/prt-2-csv/.vTraceback (most recent call last):
File “c:\python-v3-12.2\prt-2-csv\prt2csv.py”, line 4, in
for line in the_file:
^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\david\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1255.py”, line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x8c in position 10: character maps to
(.venv) PS C:\python-v3-12.2\prt-2-csv> & c:/python-v3-12.2/prt-2-csv/.vTraceback (most recent call last):
File “c:\python-v3-12.2\prt-2-csv\prt2csv.py”, line 4, in
for line in the_file:
^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\david\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1255.py”, line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x8c in position 10: character maps to
(.venv) PS C:\python-v3-12.2\prt-2-csv> & c:/python-v3-12.2/prt-2-csv/.vTraceback (most recent call last):
File “c:\python-v3-12.2\prt-2-csv\prt2csv.py”, line 4, in
for line in the_file:
^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\david\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1255.py”, line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x8c in position 10: character maps to
(.venv) PS C:\python-v3-12.2\prt-2-csv> & c:/python-v3-12.2/prt-2-csv/.vTraceback (most recent call last):
File “c:\python-v3-12.2\prt-2-csv\prt2csv.py”, line 4, in
for line in the_file:
^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\david\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1255.py”, line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x8c in position 10: character maps to
env/Scripts/python.exe c:/python-v3-12.2/prt-2-csv/prt2csv.py
Traceback (most recent call last):
File “c:\python-v3-12.2\prt-2-csv\prt2csv.py”, line 5, in
for line in the_file:
File “C:\Users\david\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1255.py”, line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x8c in position 10: character maps to
for line in the_file:
ings\cp1255.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x8c in position 10: character maps to
(.venv) PS C:\python-v3-12.2\prt-2-csv> ^C
(.venv) PS C:\python-v3-12.2\prt-2-csv> \

                                    \& c:/python-v3-12.2/prt-2-csv/.venv/Scripts/python.exe c:/python-v3-12.2/prt-2-csv/prt2csv.py
  • & c:/python-v3-12.2/prt-2-csv/.venv/Scripts/python.exe
    c:/python-v3- …
  • ~
    The ampersand (&) character is not allowed. The & operator is reserved
    for future use; wrap an ampersand in double quotation marks (“&”) to
    pass it as part of a string.
    • CategoryInfo : ParserError: (:slight_smile: , ParentContainsError
      RecordException
    • FullyQualifiedErrorId : AmpersandNotAllowed

env/Scripts/python.exe c:/python-v3-12.2/prt-2-csv/prt2csv.py
Traceback (most recent call last):
File “c:\python-v3-12.2\prt-2-csv\prt2csv.py”, line 5, in
for line in the_file:
File “C:\Users\david\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1255.py”, line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x8c in position 10: character maps to
(.venv) PS C:\python-v3-12.2\prt-2-csv> & c:/python-v3-12.2/prt-2-csv/.venv/Scripts/python.exe c:/python-v3-12.2/prt-2-csv/prt2csv.py
Traceback (most recent call last):
File “c:\python-v3-12.2\prt-2-csv\prt2csv.py”, line 5, in
for line in the_file:
File “C:\Users\david\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1255.py”, line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x8c in position 10: character maps to
(.venv) PS C:\python-v3-12.2\prt-2-csv>

In Notepad++ the encoding for the input file is Hebrew OEM 862

OK changed encoding, the script now is:

import re
file_lines =[]
cnt = 0
#input is DOS legacy print file 
with open("S285IBUD.PRN",  encoding='ISO-8859-8') as the_file: 
     for line in the_file:
         #check if new page 
         match = re.findall('0c', line)
         if match: cnt = 0
         #check if new line
         match = re.findall('0d0a', line)
         if match: cnt+=1
         #start processing from 7th line, where line length is > 4
         if cnt > 6 and len(line) > 4:
         # process the line in some way  
            #file_lines.append(line) 
            print("line no.: ",  cnt,  line)
            

and the error is:

File “c:\python-v3-12.2\prt-2-csv\prt2csv.py”, line 13, in
^^^
NameError: name ‘cnt’ is not defined. Did you mean: ‘int’?
(.venv) PS C:\python-v3-12.2\prt-2-csv> & c:/python-v3-12.2/prt-2-csv/.venv/Scripts/python.exe c:/python-v3-12.2/prt-2-csv/prt2csv.py
(.venv) PS C:\python-v3-12.2\prt-2-csv> & c:/python-v3-12.2/prt-2-csv/.venv/Scripts/python.exe c:/python-v3-12.2/prt-2-csv/prt2csv.py
(.venv) PS C:\python-v3-12.2\prt-2-csv> & c:/python-v3-12.2/prt-2-csv/.venv/Scripts/python.exe c:/python-v3-12.2/prt-2-csv/prt2csv.py
(.venv) PS C:\python-v3-12.2\prt-2-csv> & c:/python-v3-12.2/prt-2-csv/.venv/Scripts/python.exe c:/python-v3-12.2/prt-2-csv/prt2csv.py
(.venv) PS C:\python-v3-12.2\prt-2-csv> & c:/python-v3-12.2/prt-2-csv/.venv/Scripts/python.exe c:/python-v3-12.2/prt-2-csv/prt2csv.py
(.venv) PS C:\python-v3-12.2\prt-2-csv>

Thanks for your help

The new page character isn’t '0c' because that’s the 2 characters '0' and 'c'. What you want to find is '\x0c' (or '\f').

Also, the line ending isn’t '0d0a' but '\x0d\x0a' (or '\r\n').

Interestingly, Python accepts '\f' but prints it as '\x0c', unlike '\x09' (or '\t'), '\x0a' (or '\n') and '\x0d' (or '\r') which are always printed as '\t', '\n' and '\r' respectively.

I noticed that, but as the file format they are parsing contains simple page data to be sent to a printer, I’m assuming that is data from the file to be printed that are encoded in that particular legacy format, not intended to be literal line breaks in the PRN file (particularly given that lines in the file seem to be separate from raw newline characters in the data, per the logic of the code). So it doesn’t seem that has anything to do with the problem there.

It seems everything works successfully after the first time. Given I cannot reproduce the NameError (its rather straightforward to statically examine your code and see that cnt must always be defined, at least within the given block), I can only conclude the code you ran that time does not match what you have above.

The cnt name error could happen if the first match fails (i.e. cnt=0
is not run) but the second match succeeds (running cnt=cnt+1 or
similar, which fails if cnt has not first been initialsed).

That was my first thought too, but in fact cnt is initialized to 0 on the second line of the file, outside the whole block in question, so the only way there could be a NameError on the indicated lien line with the given code is if some later code we aren’t shown dels the variable (which seems very unlikely).

Glad you were able to sort out the problem. Some useful hints and general information for future reference - both for your own debugging, and for using the forum more effectively:

This is not “the first of many errors”. It is the beginning of the stack trace for a single error. Its purpose is to show you how the code execution got to the point where an error occurred. The error is at the bottom. It will look something like FooError: something bad happened. That is, there will be some name of a type of error (these normally use titlecase and have Error at the end), and then a colon, and then a message to say what the problem is.

This is the entire error. It gets repeated multiple times in your terminal, because you (or the IDE) tried to run the code multiple times.

I am guessing at the last bit, <undefined>, because it didn’t appear in your post. The problem is that this forum thinks that’s some invalid HTML, and strips it out. I also guessed to fix the whitespace at the beginning of lines, and to fix some quotes that got replaced with “smart” quotes.

When you post an error message, please format it the same way as the code. Python’s stack traces are designed to be viewed in a terminal (I removed a couple of lines that are not part of the error, but instead showing a Powershell command prompt and some commands that were provided there). Using code formatting makes sure everything lines up properly and that nothing gets lost.

Anyway, yes. In order to open a file properly - in any programming language - you must understand the following:

  • Are the contents intended to be interpreted as text?

  • If they are, what encoding is used to explain how the bytes represent text?

  • If not, what is the file format (i.e.: how do you know the meaning of each byte)?

There are no shortcuts here.

1 Like

Hi Karl,
Thanks for the tips and info, much appreciated.
One more question, the last line is

print("line no.: ",  cnt,  line)

Where/how can I see the output from this?

Thanks for your help

OK got it, it never finds new page or new line. Need to work on that.
How would I detect hex “0C” (1 byte) or hex “0D0A” (2 bytes)?

Thanks for your help

Just FYI, as you could see from the preview and the resulting post, your code as not displayed because the triple backticks must be on separate lines from the code (I fixed it for you this time).

Oh and to answer your question, you would generally enter them as literals via escape sequences. As mentioned in the linked docs, can enter byte values in hex via the \x prefix, e.g. \x0c or \x0d\x0A, and each of those particular values also have their own short escape codes, \f for FF and \r\n for CR LF.

Do note, though that if you’re reading the file in as text rather than as bytes, newline sequences are automatically converted to the standard \n (LF) in the string on input and converted back to the platform default on write, if you don’t specify otherwise via the newline parameter to open(). Therefore, \r\n in the source (as well as a lone \n and \r will be all conformed to \n in the Python string, and so will not be found.

You can either search for \n instead, read in the file in binary rather than text mode, or pass either newline='' (which leaves newline characters untouched but still treats \n and others as line terminators), or newline='\r\n' (which also leaves newlines untouched, but tells Python explicitly to treat \r\n and only \r\n as the line terminator).

Hi uploaded image of part of first page from notepad++ - encoding: Hebrew OEM-862 and an image of first few lines from hex editor - 2nd post

I want to convert this file to a csv file to import into Excel.
What would be the best way to read this file in Python3 so that I can process it and output a CSV file.

Any help much appreciated

Well, as people above, I say you are reading a binary file in text mode. Doing this may not produce any error now, but in the rest of your program you will get strange results.

First, give open an order to open the file in binary mode:

with open("FILE.PRN", "rb") as the file:

the key thing here is giving a second parameter with “b” in it.

Second: be prepared that from now on all reading from a file will give you bytes object, not str. For most part you may treat them the same, but some functions may refuse to operate on a bytes object.

Third: and last but not least, take interest in the “struct” module: struct — Interpret bytes as packed binary data — Python 3.12.2 documentation
It will be decisively your best friend in your task :slight_smile: