Reading a CSV with data type specification. Error: cannot cast array from ('O') to ('int32')

Tudor · April 21, 2021, 8:17am

Hello! I want to convert a CSV file to JSON, by using python.

Because the file I want to convert is pretty big (over 300 MB), when I read the CSV, I have to specify the data type of each column.

To read the CSV, and then convert it to JSON, I use this code:

import pandas as pd

df_train = pd.read_csv(r'invoice_train.csv', dtype={"client_id": "string", "invoice_date": "string", "tarif_type": "string", "counter_number": "string", "counter_statue": int, "counter_code": "string", "reading_remarque": "string", "counter_coefficient": int, "consommation_level_1": int, "consommation_level_2": int, "consommation_level_3": int, "consommation_level_4": int, "old_index": int, "new_index": int, "months_number": int, "counter_type": "string"}) 

df_train.to_json(r'invoice_train.json')

But, I get an error, when I run this on Anaconda Prompt:

How should I read the CSV without getting this error?

cameron · April 21, 2021, 10:06pm

Hello! I want to convert a CSV file to JSON, by using python.

Because the file I want to convert is pretty big (over 300 MB), when I read the CSV, I have to specify the data type of each column.

To read the CSV, and then convert it to JSON, I use this code:

import pandas as pd

df_train = pd.read_csv(r’invoice_train.csv’, dtype={“client_id”: “string”, “invoice_date”: “string”, “tarif_type”: “string”, “counter_number”: “string”, “counter_statue”: int, “counter_code”: “string”, “reading_remarque”: “string”, “counter_coefficient”: int, “consommation_level_1”: int, “consommation_level_2”: int, “consommation_level_3”: int, “consommation_level_4”: int, “old_index”: int, “new_index”: int, “months_number”: int, “counter_type”: “string”})

df_train.to_json(r’invoice_train.json’)

But, I get an error, when I run this on Anaconda Prompt:

Error1366×537 23.3 KB

Please just paste the error text into your message directly. Those of us
reading this as email have to jump through hoops to see your screenshot.

How should I read the CSV without getting this error?

The error at the bottom said “Invalid literal for int() with base 10:
‘A’”. That suggests that your data had an ‘A’ in a column where you
expected in int. Unfortunately it looks like pandas doesn’t provide any
context letting you find the source data.

I’d just use the csv module myself. It would require writing the whole
read-and-convert code yourself.

Totoally untested example:

from csv import DictReader
import json

csv_filename = r'invoice_train.csv'
json_filename = r'invoice_train.json'

# provide actual functions for each conversion
conversions = {"client_id": str, "invoice_date": str, "counter_statue": int}

# read the file row by row, convert, keep context so you know what
# went wrong _and where_ from the exception messages
context = filename
with open(filename) as csvf:
    reader = DictReader(csvf)
    all_rows = []
    for rownum, row in enumerate(reader, 1):
        row_context = f'{context}, row {rownum}'
        for column_name, conversion in conversions.items():
            column_context = f'{row_context}: {column_name}'
            try:
                value = row[column_name]
            except KeyError:
                raise KeyError(f"{column_context}: no such column") from e
            else:
                value_context = f'{column_context}: {value!r}'
                try:
                    row[column_name] = conversion(value)
                except ValueError as e:
                    raise ValueError(f"{value_context}: cannot convert using {conversion}: {e}") from e
        all_rows.append(row)

# save as JSON
with open(json_file, 'w') as jsonf:
    json.dump(all_rows, jsonf)

The pandas.read-csv function is very versatile, but it clearly produces
lousy exceptions when the data are not formatted as you expect.

Cheers,
Cameron Simpson cs@cskk.id.au

Blackward · April 25, 2021, 8:39pm

Hi Tudor,

you alternatively could parse the csv file line by line using regular expressions.

If you are used to the re module, this is very easy and done quickly too. Using regular expressions, you have more control in case of errors - you e.g. can ignore lines, which do not match the line regular expression and print them out afterwards with a belonging error message or alike.

Cheers, Dominik

Tudor · April 28, 2021, 1:53pm

Ok, I didn’t know that. Thank you for warning me.

Yes, I opened the CSV file in Excel and tried to find this ‘A’ in every column with “Ctrl + F”, but I didn’t find anything. Although, the Excel doesn’t open the whole file, because the file exceeds the Excel capacity.

I ran this code with “conversions” variable adjusted:

conversions = {"client_id": "string", "invoice_date": "string", "tarif_type": "string", "counter_number": "string", "invoice_date": "string", "counter_status": int, "counter_code": "string", "reading_remarque": "string", "counter_coefficient": int, "consommation_level_1": int, "consommation_level_2": int, "consommation_level_3": int, "consommation_level_4": int, "old_index": int, "new_index": int, "months_number": int, "counter_type": "string"}

but I bump into the following error:
Traceback (most recent call last):
File “”, line 15, in
TypeError: ‘str’ object is not callable

Regarding this line of code, with “conversions” variable, I tried the str and ‘str’ options, insead of “string”, but this doesn’t change anything.

Also, in this error message, it’s saying that the error is in the line 15. Regarding this, when counting lines, do I have to take into account the comments and newlines?

Thanks for the reply!

cameron · April 28, 2021, 11:06pm

cameron:

The error at the bottom said “Invalid literal for int() with base 10:
‘A’”. That suggests that your data had an ‘A’ in a column where you
expected in int. Unfortunately it looks like pandas doesn’t provide any
context letting you find the source data.

Yes, I opened the CSV file in Excel and tried to find this ‘A’ in every column with “Ctrl + F”, but I didn’t find anything. Although, the Excel doesn’t open the whole file, because the file exceeds the Excel capacity.

Hahahaha! Oh Excel!

CSV files are text. You could just pull it up in a decent text editor
and look.

cameron:

Totoally untested example:

I ran this code with “conversions” variable adjusted:

conversions = {“client_id”: “string”, “invoice_date”: “string”, “tarif_type”: “string”, “counter_number”: “string”, “invoice_date”: “string”, “counter_status”: int, “counter_code”: “string”, “reading_remarque”: “string”, “counter_coefficient”: int, “consommation_level_1”: int, “consommation_level_2”: int, “consommation_level_3”: int, “consommation_level_4”: int, “old_index”: int, “new_index”: int, “months_number”: int, “counter_type”: “string”}

but I bump into the following error:
Traceback (most recent call last):
File “”, line 15, in
TypeError: ‘str’ object is not callable

Regarding this line of code, with “conversions” variable, I tried the str and ‘str’ options, insead of “string”, but this doesn’t change anything.

Ok. The conversions dict in my code was a mapping of column names to
conversion functions (callables). So “string”, being a str, is not a
callable thing. (You could add mode stuff to the code to recognise that
and use str(), which panda probably does, but for testing purposes let’s
not).

So a pair like:

"client_id": "string",

would want to be:

"client_id": str,

since types are callables.

Note that csv.reader has a default dialect, and may be preconverting
some column values. Eg source data like:

"foo bah", 5

might already be converted to a str and an int. But both str() and int()
accept their own type and change nothing, so this is safe.

Also, in this error message, it’s saying that the error is in the line 15. Regarding this, when counting lines, do I have to take into account the comments and newlines?

Yes you do.

Personally, I put these things into files eg “scan_my_csv.py” and run
the file. But if you’re in IDLE or something else with a good “edit my
code” window, no need.

Cheers,
Cameron Simpson cs@cskk.id.au

Tudor · May 10, 2021, 9:45am

Hello! As @cameron suggested, I used the CSV file as a text file, not as an Excel file, as I originally did. So, I downloaded again the CSV file, but this time, I opened it with a text editor. I realised that the file size had’t changed. And these are good news, because now I have at hand the whole file to find that “A” that triggers the error.

Initially, the search object I used was ”A”, so I just used ”CTRL-F”. But, because there is a lot of string data that contains ”A”, and there are over 4 millions rows, this tactic was useless.

Then, I used regular expressions to find that “A”, but these time I included a pattern: ”,A”. This comma before the ”A” helped to find these ”A”s.

So, this is the code:

#Identifying the ",A"s:
file1 = open(r"invoice_train.csv", "r").read()
import re
for line in re.findall(",A", file1):
	print(line)

And this is the output:

,A
,A
,A
,A
,A
,A
,A
,A
,A
,A
,A
,A
,A

So, I found these ”A”s. But I don’t know how to remove the rows that contain them. How should I do this?

cameron · May 10, 2021, 11:36am

Hello! As @cameron suggested, I used the CSV file as a text file, not
as an Excel file, as I originally did. So, I downloaded again the CSV
file, but this time, I opened it with a text editor. I realised that
the file size had’t changed. And these are good news, because now I
have at hand the whole file to find that “A” that triggers the error.

Initially, the search object I used was ”A”, so I just used ”CTRL-F”. But, because there is a lot of string data that contains ”A”, and there are over 4 millions rows, this tactic was useless.

But you could search for “,A” in the text editor, since you can see
there’s only about 10 from your code below.

Then, I used regular expressions to find that “A”, but these time I included a pattern: ”,A”. This comma before the ”A” helped to find these ”A”s.

So, this is the code:

#Identifying the “,A"s:
file1 = open(r"invoice_train.csv”, “r”).read()

This consumes a lot of memory because it reads to whole file in in one
go. It’s better to read the file line by line, since it is lines of
text and you’re interested in them as lines.

import re

Try to put imports at the top of your script ahead of everything else.

for line in re.findall(“,A”, file1):
print(line)

This scans the entire file text as a single string.

However, you’re interested in lines and line numbers. So let’s read the
file line by line:

with open('invoice_train.csv') as f:
    for lineno, line in enumerate(f, 1):
        if ",A" in line:
            print("line", lineno, line.strip())

Things to notice:

We open the file using the “with open() as f:” idiom. This ensures that
the file gets closed as soon as the programme gets out of the “with”
clause even of there’s an exception.

Text files are iterable, yielding lines. SO you can read all the lines,
one at a time, like this:

for line in f:

Python’s builtin enumerate() function takes an iterable (for us, the
file, so lines of text) and yields (i, value) for each value. So this:

    for lineno, line in enumerate(f, 1):

yields (1,“first line\n”), (2,“second line\n”) etc etc, which gets you
line numbers for free.

The re module is overkill for looking for a fixed string, and slower.
You can just ask Python if a string has a substring liek this:

if ",A" in line:

Then we just print the line number and line. The line iteration
includes the trailing newline, so we strip that off for the print().

Regular expressions should always be your second choice (well, I really
mean: not your first choice) for simple stuff like this. There are
things for which they’re a good match, but they’re cryptic and error
prone (I don’t mean unreliable, I mean hard to get correct for anything
nontrivial), and thus to be avoided unless they’re a superior choice in
other ways.

Cheers,
Cameron Simpson cs@cskk.id.au

Blackward · May 10, 2021, 7:49pm

Hi Tudor,

I normally use the re.match(…) method for that, e.g. alike the following:

import re
patternRe = re.compile(r"^.*[,]R.*$")
with open("invoice_train.csv", "r") as fileHandle:
       for currLineS in fileHandle:
            if patternRe.match( currLineS ) == None:
               print (currLineS.rstrip())

This just prints the lines without ",R"s.

If you want to get used with regular expressions, also have a look into the topic “named groups” and the groupdict() method - both come quite handy for more advanced parsing tasks!

Cheers, Dominik

cameron · May 10, 2021, 11:08pm

And this is just part of why I’m against using regular expressions when
simpler and clearer methods will do. Look at this str-containing-a-regexp:

r"^.*[,]R.*$"

and the corresponding code test:

patternRe.match( currLineS ) == None

(BTW, you want “is”, and not “==”, for a None test.)

Without a regexp the if-statement looks like this:

if ",R' not in currLineS:

Cheers,
Cameron Simpson cs@cskk.id.au

Blackward · May 10, 2021, 11:55pm

With “in” your options are quite limited. If you e.g. also wanted to find all “,*R” whereby * is an arbitrary sequence of spaces, the “in” option does not work anymore. The approach using regular expressions is more general and powerful.

Understanding as well as creating regular expressions just is a question of exercise. It is a good idea to start practicing early.

Cheers, Dominik

Tudor · May 11, 2021, 10:59am

Incredibly simple solution .

I searched for “,A” in the text editor, found the rows with the "A"s and deleted them.

Then, I ran the code for reading the CSV and converting it to JSON without errors.

Thanks guys for all your advice!