Reading a CSV with data type specification. Error: cannot cast array from ('O') to ('int32')

Hello! I want to convert a CSV file to JSON, by using python.

Because the file I want to convert is pretty big (over 300 MB), when I read the CSV, I have to specify the data type of each column.

To read the CSV, and then convert it to JSON, I use this code:

import pandas as pd

df_train = pd.read_csv(r'invoice_train.csv', dtype={"client_id": "string", "invoice_date": "string", "tarif_type": "string", "counter_number": "string", "counter_statue": int, "counter_code": "string", "reading_remarque": "string", "counter_coefficient": int, "consommation_level_1": int, "consommation_level_2": int, "consommation_level_3": int, "consommation_level_4": int, "old_index": int, "new_index": int, "months_number": int, "counter_type": "string"}) 

df_train.to_json(r'invoice_train.json')

But, I get an error, when I run this on Anaconda Prompt:

How should I read the CSV without getting this error?

Hello! I want to convert a CSV file to JSON, by using python.

Because the file I want to convert is pretty big (over 300 MB), when I read the CSV, I have to specify the data type of each column.

To read the CSV, and then convert it to JSON, I use this code:

import pandas as pd

df_train = pd.read_csv(r’invoice_train.csv’, dtype={“client_id”: “string”, “invoice_date”: “string”, “tarif_type”: “string”, “counter_number”: “string”, “counter_statue”: int, “counter_code”: “string”, “reading_remarque”: “string”, “counter_coefficient”: int, “consommation_level_1”: int, “consommation_level_2”: int, “consommation_level_3”: int, “consommation_level_4”: int, “old_index”: int, “new_index”: int, “months_number”: int, “counter_type”: “string”})

df_train.to_json(r’invoice_train.json’)

But, I get an error, when I run this on Anaconda Prompt:

Please just paste the error text into your message directly. Those of us
reading this as email have to jump through hoops to see your screenshot.

How should I read the CSV without getting this error?

The error at the bottom said “Invalid literal for int() with base 10:
‘A’”. That suggests that your data had an ‘A’ in a column where you
expected in int. Unfortunately it looks like pandas doesn’t provide any
context letting you find the source data.

I’d just use the csv module myself. It would require writing the whole
read-and-convert code yourself.

Totoally untested example:

from csv import DictReader
import json

csv_filename = r'invoice_train.csv'
json_filename = r'invoice_train.json'

# provide actual functions for each conversion
conversions = {"client_id": str, "invoice_date": str, "counter_statue": int}

# read the file row by row, convert, keep context so you know what
# went wrong _and where_ from the exception messages
context = filename
with open(filename) as csvf:
    reader = DictReader(csvf)
    all_rows = []
    for rownum, row in enumerate(reader, 1):
        row_context = f'{context}, row {rownum}'
        for column_name, conversion in conversions.items():
            column_context = f'{row_context}: {column_name}'
            try:
                value = row[column_name]
            except KeyError:
                raise KeyError(f"{column_context}: no such column") from e
            else:
                value_context = f'{column_context}: {value!r}'
                try:
                    row[column_name] = conversion(value)
                except ValueError as e:
                    raise ValueError(f"{value_context}: cannot convert using {conversion}: {e}") from e
        all_rows.append(row)

# save as JSON
with open(json_file, 'w') as jsonf:
    json.dump(all_rows, jsonf)

The pandas.read-csv function is very versatile, but it clearly produces
lousy exceptions when the data are not formatted as you expect.

Cheers,
Cameron Simpson cs@cskk.id.au

Hi Tudor,

you alternatively could parse the csv file line by line using regular expressions.

If you are used to the re module, this is very easy and done quickly too. Using regular expressions, you have more control in case of errors - you e.g. can ignore lines, which do not match the line regular expression and print them out afterwards with a belonging error message or alike.

Cheers, Dominik

Ok, I didn’t know that. Thank you for warning me. :slightly_smiling_face:

Yes, I opened the CSV file in Excel and tried to find this ‘A’ in every column with “Ctrl + F”, but I didn’t find anything. Although, the Excel doesn’t open the whole file, because the file exceeds the Excel capacity.

I ran this code with “conversions” variable adjusted:

conversions = {"client_id": "string", "invoice_date": "string", "tarif_type": "string", "counter_number": "string", "invoice_date": "string", "counter_status": int, "counter_code": "string", "reading_remarque": "string", "counter_coefficient": int, "consommation_level_1": int, "consommation_level_2": int, "consommation_level_3": int, "consommation_level_4": int, "old_index": int, "new_index": int, "months_number": int, "counter_type": "string"}

but I bump into the following error:
Traceback (most recent call last):
File “”, line 15, in
TypeError: ‘str’ object is not callable

Regarding this line of code, with “conversions” variable, I tried the str and ‘str’ options, insead of “string”, but this doesn’t change anything.

Also, in this error message, it’s saying that the error is in the line 15. Regarding this, when counting lines, do I have to take into account the comments and newlines?

Thanks for the reply!

Yes, I opened the CSV file in Excel and tried to find this ‘A’ in every column with “Ctrl + F”, but I didn’t find anything. Although, the Excel doesn’t open the whole file, because the file exceeds the Excel capacity.

Hahahaha! Oh Excel!

CSV files are text. You could just pull it up in a decent text editor
and look.

I ran this code with “conversions” variable adjusted:

conversions = {“client_id”: “string”, “invoice_date”: “string”, “tarif_type”: “string”, “counter_number”: “string”, “invoice_date”: “string”, “counter_status”: int, “counter_code”: “string”, “reading_remarque”: “string”, “counter_coefficient”: int, “consommation_level_1”: int, “consommation_level_2”: int, “consommation_level_3”: int, “consommation_level_4”: int, “old_index”: int, “new_index”: int, “months_number”: int, “counter_type”: “string”}

but I bump into the following error:
Traceback (most recent call last):
File “”, line 15, in
TypeError: ‘str’ object is not callable

Regarding this line of code, with “conversions” variable, I tried the str and ‘str’ options, insead of “string”, but this doesn’t change anything.

Ok. The conversions dict in my code was a mapping of column names to
conversion functions (callables). So “string”, being a str, is not a
callable thing. (You could add mode stuff to the code to recognise that
and use str(), which panda probably does, but for testing purposes let’s
not).

So a pair like:

"client_id": "string",

would want to be:

"client_id": str,

since types are callables.

Note that csv.reader has a default dialect, and may be preconverting
some column values. Eg source data like:

"foo bah", 5

might already be converted to a str and an int. But both str() and int()
accept their own type and change nothing, so this is safe.

Also, in this error message, it’s saying that the error is in the line 15. Regarding this, when counting lines, do I have to take into account the comments and newlines?

Yes you do.

Personally, I put these things into files eg “scan_my_csv.py” and run
the file. But if you’re in IDLE or something else with a good “edit my
code” window, no need.

Cheers,
Cameron Simpson cs@cskk.id.au