Hello! As @cameron suggested, I used the CSV file as a text file, not
as an Excel file, as I originally did. So, I downloaded again the CSV
file, but this time, I opened it with a text editor. I realised that
the file size had’t changed. And these are good news, because now I
have at hand the whole file to find that “A” that triggers the error.
Initially, the search object I used was ”A”, so I just used ”CTRL-F”. But, because there is a lot of string data that contains ”A”, and there are over 4 millions rows, this tactic was useless.
But you could search for “,A” in the text editor, since you can see
there’s only about 10 from your code below.
Then, I used regular expressions to find that “A”, but these time I included a pattern: ”,A”. This comma before the ”A” helped to find these ”A”s.
So, this is the code:
#Identifying the “,A"s:
file1 = open(r"invoice_train.csv”, “r”).read()
This consumes a lot of memory because it reads to whole file in in one
go. It’s better to read the file line by line, since it is lines of
text and you’re interested in them as lines.
import re
Try to put imports at the top of your script ahead of everything else.
for line in re.findall(“,A”, file1):
print(line)
This scans the entire file text as a single string.
However, you’re interested in lines and line numbers. So let’s read the
file line by line:
with open('invoice_train.csv') as f:
for lineno, line in enumerate(f, 1):
if ",A" in line:
print("line", lineno, line.strip())
Things to notice:
We open the file using the “with open() as f:” idiom. This ensures that
the file gets closed as soon as the programme gets out of the “with”
clause even of there’s an exception.
Text files are iterable, yielding lines. SO you can read all the lines,
one at a time, like this:
for line in f:
Python’s builtin enumerate() function takes an iterable (for us, the
file, so lines of text) and yields (i, value) for each value. So this:
for lineno, line in enumerate(f, 1):
yields (1,“first line\n”), (2,“second line\n”) etc etc, which gets you
line numbers for free.
The re module is overkill for looking for a fixed string, and slower.
You can just ask Python if a string has a substring liek this:
if ",A" in line:
Then we just print the line number and line. The line iteration
includes the trailing newline, so we strip that off for the print().
Regular expressions should always be your second choice (well, I really
mean: not your first choice) for simple stuff like this. There are
things for which they’re a good match, but they’re cryptic and error
prone (I don’t mean unreliable, I mean hard to get correct for anything
nontrivial), and thus to be avoided unless they’re a superior choice in
other ways.
Cheers,
Cameron Simpson cs@cskk.id.au