How to convert all nan to empty string '' in pandas.read_excel()

System:

  • Windows 2025 v24H2 with Python 3.14
  • Pandas 2.3.3.

I’m trying to read an XLSX spreadsheet with pandas and am having a fit with empty cells that contain nan. Here’s my code which worked in Python 3.11.5. I’d like all nan converted to an empty string ‘’.

I’m trying to get 3 fields: Firstname, Lastname, and Degree then contcat the together. But this gets an error when one field is nan. So I’m assuming I have to change all nan to an empty string first.

    try: 
        df = pandas.read_excel(options.inputfile, na_values='', usecols='A:W') # This no longer works.
        df.fillna('', inplace=True) # Replace nan with empty string.
    except PermissionError: 
        print(f"{procname} ERROR: File '{options.inputfile}' might be opened. Please close it then rerun program")
        sys.exit()
    except Exception as e:
        print(f"{procname} ERROR-openss opening spreadsheet {options.inputfile}. {e}")
        sys.exit()

# A bit later
    df['Firstname'].str.strip()
    df['Lastname'].str.strip()
    df['Degree'].str.strip()

This line df.fillna('', inplace=True) is something new I tried but gives me the error: “FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value ‘’ has dtype incompatible with float64, please explicitly cast to a compatible dtype first.”

I’m not sure how to fix this. Would someone mind helping me out? Surely others have dealt with nan in spreadsheets.

Thank you!

That’s not an error, it’s a warning (FutureWarning). In a future version of pandas, it will result in an error, but not in the version you’re using. Did you get the result you were expecting?

The reason for the warning is that your data frame (or at least some of its columns) are being stored as numeric. Trying to replace nan (a numeric value), with an empty string (a string value) might require changing the type of the affected columns. This warning is telling you (a) that this auto conversion happened and (b) the pandas will stop auto-converting in the future.

There are a few ways to resolve the warning. In order to suggest the most helpful one, we need to know what are you trying to do any why do you want to replace nan with the empty string?

(More specifically: do you want everything to be stored as a string, or do you want a mix of numbers and strings in your data frame?)

Edit: re-reading your post makes clear you want everything as a string. I will make a suggestion shortly

I’m reading a spreadsheet row by row. For each row I get the data from these columns: Firstname, Lastname, Degree. Then I concat them into a key like this:

# The "row" variable has one row from the dataframe.
fname = row['Firstname']
lname = row['Lastname']
degree = row['Degree']
curkey = fname + " " + lname + " " + degree

But there are some blank rows in the spreadsheet. When I find a blank row, fname contains nan, which I think is a type float. When I concant that to make curkey I get an error.

I would suggest changing your code to filter out/drop the empty rows before processing.

The easiest would be to change your code to:

        df = pandas.read_excel(options.inputfile, usecols='A:W')
        df = df.dropna(how="all", axis="index")  # removes all empty rows

Otherwise, if those blank rows are important for other reasons, you should replace the nans only in the columns you expect to be strings:

        df = pandas.read_excel(options.inputfile, usecols='A:W')
        df["Firstname"] = df["Firstname"].fillna("")
        df["Lastname"] = df["Lastname"].fillna("")
        df["Firstname"] = df["Firstname"].fillna("")

Side note, I removed your use of the na_values keyword argument, as, per the docs, that provides a list values that should be turned into nan values.

1 Like

Thanks. I’m pretty sure the blank rows must stay in there as I’m modifying the spreadsheet and returning it to the sender.

I have solved this I believe.

# First change column to string type which will convert nan to 'nan'. 
    # Change types to str.
    df['Firstname'] = df['Firstname'].astype(str)
    df['Lastname'] = df['Lastname'].astype(str)
    df['Degree'] = df['Degree'].astype(str)

# Later check for 'nan' in 2 columns. If they are both 'nan' (string) then it's a blank row, skip the blank row.
        if (row['Firstname'] == 'nan') and (row['Lastname'] == 'nan'):
            outarr.append('') # Append blank row
            continue # Skip blank row, go to top of loop

That should also work, so long as you never encounter someone named “nan nan” :-). If you run into issues like that, you could combine fillna with astype:
df["Firstname"] = df["Firstname"].fillna("").astype(str).

An alternative approach would be to leave the nan values as is and use pd.isna() to replace your current check for "nan", but I find using astype to enforce your assumptions about the type of each column to more robust than needing to include a nan-check before every string operation.

1 Like

For those who need to do fillna() right after the dataframe is read here is what did not work, and what works.

    try:
        df = pandas.read_excel(filename, usecols='A:H')
        #df.fillna('', inplace=True) # Does not work
        df = df.fillna('') # This works.
    except Exception as e:
        print(f"{procname} ERROR-read_excel: {e}")
        sys.exit(1)

You are officially my hero 8^)

Thanks - this saved me a LOT of mucking around with the datatypes on a simple Excel file import to a DB.

1 Like

It’s often a good idea to tell Pandas explicitly which columns are meant to contain which data using dtype=. See pandas.read_excel — pandas 3.0.0 documentation which includes an example of declaring different data types for different columns.

In this case it looks like you have 3 columns containing string data. To ensure that you’re getting the data types you expect, you can tell Pandas explicitly to use those types when loading in the data:

import sys
import pandas as pd

if __name__ == "__main__":
    filename = sys.argv[1]
    data = pd.read_excel(
        filename,
        usecols="A:H",
        dtype={
            "Firstname": "str",
            "Lastname": "str",
            "Degree": "str",
        },
    )

It’s likely that Pandas inferred this data type automatically, but it rarely hurts to specify. Note that "str” isn’t available in Pandas 2.x; for that, use "string” (which is almost the same).

Now in addition to the above, processing data row-by-row is usually not the best pattern. Sometimes it’s unavoidable, but often people do it when they just don’t realize there’s a better way. This matters because iterating over rows with .iterrows or .apply can do weird things with dtypes, because each row is converted to a pd.Series, and that gets messy when column dtypes are heterogeneous and/or contain null data. That’s seems to be the situation here.

So in this case you can do:

import sys
import pandas as pd

def load_data(filename: str) -> pd.DataFrame:
    return pd.read_excel(
        filename,
        usecols="A:H",
        dtype={
            "Firstname": "string",
            "Lastname": "string",
            "Degree": "string",
        },
    )

def make_lookup_key(data: pd.DataFrame) -> pd.Series:
    firstname = data["Firstname"].fillna("")
    lastname = data["Lastname"].fillna("")
    degree = data["Degree"].fillna("")
    return firstname + " " + lastname + " " + degree

if __name__ == "__main__":
    filename = sys.argv[1]
    data = load_data(filename)
    data["lookup_key"] = make_lookup_key(data)

And if I really cannot avoid iterating row-wise, I find that almost always I am pulling out individual scalar values from each row, which is what it looks like you’re doing here. In that case, I prefer one of the two patterns below.

Pattern 1 is to iterate over row numbers, or better yet over the data frame “index” (i.e. row labels):

# OK
for n in range(len(data))
    firstname = data["Firstname"].iat[n]
    lastname = data["Lastname"].iat[n]

# Better
for ix in data.index:
    firstname = data.at[ix, "Firstname"]
    lastname = data.at[ix, "Lastname"]

Note the use of .at and .iat instead of .loc and .iloc here, to indicate that the result is always scalar (a single value) and not a series or data frame.

Pattern 2 is to use .itertuples, which tends to reduce dtype weirdness:

for row in data.itertuples():
    firstname = data.Firstname
    lastname = data.Lastname

Admittedly I’m not sure how well that works with null data.

The only time I ever deal with rows-as-series is when I’m using .apply(…, axis=1). In that situation, I figure it out on a case-by-case basis. I very rarely need it.

Finally, note that it’s usually a good idea to avoid inplace=True unless you know you have a good reason to use it, e.g. you are writing performance-sensitive code and you’ve systematically measured a meaningful difference.

That’s a good idea in general. However we get spreadsheets from customers with all kinds of garbage in the cells so that won’t work for us. Cells could have: embedded CRLF, embedded tabs, accented characters, and that’s just what we’ve found so far.

The odd thing is the accented characters do not show up when the file is opened in Excel. But my python program finds them.

I know this isn’t at all related to your original question, but look into text encoding! It’s how files represent characters as sequences of bytes. If two programs - for example, Excel and your Python script - read the same file but decode the bytes with different text encodings, it will appear with differences like this.

Recent Python versions decode text using UTF-8 by default. Excel or the original creator of the file perhaps used a different one.