Pandas ValueError, num columns in header and row does not match

c-rob · July 2, 2024, 11:17am

I’m using Python 3.11 and Pandas v2.2.2 on Windows 10.

When I assign my data to a dataframe like this

    '''datadict = a dict of all my data with keys.
    options = a class holding different data to pass between functions.
    options.headersxl = ['Hdr1', 'hdr2', 'hdr3', 'hdr4']
    '''
    import pandas
    import xlsxwriter
    procname = str(inspect.stack()[0][3]) + ":" # Returns function name.

    fn = os.path.join(options.progdir, options.datafn )
    tfn = fn + ".xlsx"
    if os.path.exists(tfn): 
        fn + "_" + getdatetime()
    fn + ".xlsx"
    outdata = []
    sheetname = 'Sheet1'
    # Now append rest of data row-by-row.
    for key, val in datadict.items(): 
        # tlist = [key.split(',') for key in datadict] # Split one row
        t = datadict[key]
        tlist = t.split(DATASEP) # Convert one row to a list. 
        outdata.append(tlist) # Save one row
        
    try: 
        df=pandas.DataFrame(outdata, columns=options.headersxl) # Make a dataframe from the list of lists. 
    except ValueError as e: 
        sys.exit(f"{procname} ERROR-dataframe1: Num headers does not match num columns.")
    except Exception as e: 
        sys.exit(f"{procname} ERROR-dataframe2: {e}")

I get an error “ValueError: 15 columns passed, passed data had 16 columns”.

How do I avoid this error? The data in every row may not have the same number of columns that the headers do.

Thank you.

dg-pb · July 2, 2024, 12:02pm

I suggest you use an IDE with debugger.

PyCharm Community Edition can be a good option.

It will let you walk the code line by line and inspect existing objects visually at any time.

You will be able to see attributes and dimensions for each object as you step through it.

onePythonUser · July 2, 2024, 12:36pm

Hi,

first let’s recognize the reason for the error. In the following example, if we attempt to pass in data that is greater than the number of columns defined, we will get the same error as you have described (albeit with different values).

# There are 4 column titles defined
col_titles = ['col1', 'col2', 'col3', 'col4']

data = [['data1', 'data2', 'data3', 'data4'],        # four columns
        ["blue", 'red', 'green', 'orange','silver']] # five columns

df = pd.DataFrame(data, columns = col_titles)

print('\n', df)

If we define the same script, but this time with matching columns, we will no longer receive the error:

# There are 4 column titles defined
col_titles = ['col1', 'col2', 'col3', 'col4']

data = [['data1', 'data2', 'data3', 'data4'], # four columns
        ["blue", 'red', 'green', 'orange']]   # four columns

df = pd.DataFrame(data, columns = col_titles)

print('\n', df)

From your last post regarding the same script, it goes back to how you are setting up your data arrays. Make sure that you understand how to set up your data prior to sending it to panda for processing. The number of data elements has to match the number of columns that have been defined.

c-rob · July 2, 2024, 1:07pm

We have software that allows users to enter data into a screen via a browser and then the software saves the data to a database. I’m getting all my data direct from that database (MSSQL) and one field I get is a job description. I’m using a tab as a column separator to store columns of data in a dict called datadict, and then store the datadict in a dataframe, and I wonder if the our software allows the user to enter a tab and save it in the database.

I will have to check for that.

There are other free-form text entry fields our database has that I’m using as well so I may have to check those for an embedded tab as well, which might cause me to make an extra column in the dataframe.

I’ll give it a look.

Just about anything could be in those text fields I get from the db, even hidden characters that someone copied and pasted from another app. They wouldn’t be able to see those.

onePythonUser · July 2, 2024, 2:14pm

You can potentially check where the discrepancy is being generated by comparing the length of the options.headersxl attribute and the data arrays that are being appended. For example,
you can include the following code within the for loop as a simple check:

if len(options.headersxl) != len(tlist):
    print('The source of the discrepancy is: \n', tlist)

You can then view the contents of the data array that is being appended and determine their validity.