It’s often a good idea to tell Pandas explicitly which columns are meant to contain which data using dtype=. See pandas.read_excel — pandas 3.0.0 documentation which includes an example of declaring different data types for different columns.
In this case it looks like you have 3 columns containing string data. To ensure that you’re getting the data types you expect, you can tell Pandas explicitly to use those types when loading in the data:
import sys
import pandas as pd
if __name__ == "__main__":
filename = sys.argv[1]
data = pd.read_excel(
filename,
usecols="A:H",
dtype={
"Firstname": "str",
"Lastname": "str",
"Degree": "str",
},
)
It’s likely that Pandas inferred this data type automatically, but it rarely hurts to specify. Note that "str” isn’t available in Pandas 2.x; for that, use "string” (which is almost the same).
Now in addition to the above, processing data row-by-row is usually not the best pattern. Sometimes it’s unavoidable, but often people do it when they just don’t realize there’s a better way. This matters because iterating over rows with .iterrows or .apply can do weird things with dtypes, because each row is converted to a pd.Series, and that gets messy when column dtypes are heterogeneous and/or contain null data. That’s seems to be the situation here.
So in this case you can do:
import sys
import pandas as pd
def load_data(filename: str) -> pd.DataFrame:
return pd.read_excel(
filename,
usecols="A:H",
dtype={
"Firstname": "string",
"Lastname": "string",
"Degree": "string",
},
)
def make_lookup_key(data: pd.DataFrame) -> pd.Series:
firstname = data["Firstname"].fillna("")
lastname = data["Lastname"].fillna("")
degree = data["Degree"].fillna("")
return firstname + " " + lastname + " " + degree
if __name__ == "__main__":
filename = sys.argv[1]
data = load_data(filename)
data["lookup_key"] = make_lookup_key(data)
And if I really cannot avoid iterating row-wise, I find that almost always I am pulling out individual scalar values from each row, which is what it looks like you’re doing here. In that case, I prefer one of the two patterns below.
Pattern 1 is to iterate over row numbers, or better yet over the data frame “index” (i.e. row labels):
# OK
for n in range(len(data))
firstname = data["Firstname"].iat[n]
lastname = data["Lastname"].iat[n]
# Better
for ix in data.index:
firstname = data.at[ix, "Firstname"]
lastname = data.at[ix, "Lastname"]
Note the use of .at and .iat instead of .loc and .iloc here, to indicate that the result is always scalar (a single value) and not a series or data frame.
Pattern 2 is to use .itertuples, which tends to reduce dtype weirdness:
for row in data.itertuples():
firstname = data.Firstname
lastname = data.Lastname
Admittedly I’m not sure how well that works with null data.
The only time I ever deal with rows-as-series is when I’m using .apply(…, axis=1). In that situation, I figure it out on a case-by-case basis. I very rarely need it.
Finally, note that it’s usually a good idea to avoid inplace=True unless you know you have a good reason to use it, e.g. you are writing performance-sensitive code and you’ve systematically measured a meaningful difference.