For loop in dataframe in pandas

Hello,

I have a problem with a “for loop” using a dataframe in pandas, hope somebody can help with that.

I have the following dataframe in a csv file:

,forename,surname,gender,age,100m,200m,400m,800m,1500m
0,Migdalia,Parrish,F,18,11.08,29.0,59.41,122.05,259.11
1,Valerie,Lee,F,10,17.23,46.0,100.02,232.64,480.95
2,John,Debnam,M,17,10.81,25.89,50.6,110.29,232.39
3,Roy,Miller,M,10,19.18,46.74,95.32,201.14,430.27
4,Aida,Aumiller,F,11,15.3,41.83,81.06,189.03,394.9
5,Marcia,Brown,F,19,11.13,24.62,57.59,119.13,256.37
6,Harry,Knows,M,16,12.39,25.94,49.67,106.56,237.14
7,Barry,Lennon,M,14,11.15,23.56,46.46,110.89,230.49
8,Lilia,Armstrong,F,13,8.84,25.09,59.54,128.95,258.47
9,Johnny,Casey,M,15,9.65,22.67,49.46,112.85,233.87
10,Donald,Taylor,M,15,11.74,22.42,49.22,114.62,224.63
11,Martha,Woods,F,14,9.01,24.34,55.25,118.8,254.87
12,Diane,Lauria,F,15,8.99,27.92,54.79,119.89,249.21
13,Yvonne,Pumphrey,F,16,8.84,27.29,57.63,123.13,247.41
14,Betty,Stephenson,F,14,11.04,28.73,59.05,126.29,256.44
15,Lilia,Armstrong,F,12,11.31,34.43,74.28,150.05,321.07

And I have to create a main function that calls another function that, using a “for loop”, retrieves the athlete with the fastest time for each age (10,11,12,13,14,15,16) for a specific gender (e.g. ‘F’) and distance (e.g. ‘100m’).

For example:
Input:
fastest_athletes = find_fastest_athletes(df,“100m”,“F”,[10,11,12,13,14,15,16])
Output:
{
10: {’forename’: 'Valerie’, 'surname’: 'Lee’, 'time’: '17.23’},
11: {’forename’: 'Aida’, 'surname’: 'Aumiller’, 'time’: '15.3’},
12: {’forename’: 'Lilia’, 'surname’: 'Armstrong’, 'time’: '11.31’},
13: {’forename’: 'Lilia’, 'surname’: 'Armstrong’, 'time’: '8.84’},
14: {’forename’: 'Martha’, 'surname’: 'Woods’, 'time’: '9.01’},
15: {’forename’: 'Diane’, 'surname’: 'Lauria’, 'time’: '8.99’},
16: {’forename’: 'Yvonne’, 'surname’: 'Pumphrey’, 'time’: '8.84’}
}

I did the following code:

# Function with the for loop
def find_fastest_athletes(df,distance,gender,ages):
  for age in range(10,16):
    fastest_athletes = df[(df["gender"] == gender) & (df["age"] == age)]
    fastest_athletes_sorted = fastest_athletes.sort_values(distance,ascending=True)
    fastest_athletes_value = fastest_athletes_sorted.iloc[[0]][["forename","surname","100m"]]
    athletes_data = fastest_athletes_value.to_string(index=False, header=False).split('  ')
    athletes_data_dict = {
        'forename': athletes_data[0].strip(),
        'surname': athletes_data[1],
        'time': float(athletes_data[2])
        }
  return athletes_data_dict
  
# Main function
def main(filename='athletes.csv'):
    df = pd.read_csv(filename, index_col=0)
    df['100m'] = df['100m'].astype(float)
    print(find_fastest_athletes(df,'100m','F',[10,11,12,13,14,15,16]))
    return
   
if __name__ == "__main__":
  main()   

With my coding I get as output ONLY the fastest athlete for the last age (16 year’s old) and not ALL the fastest athletes for each age (10,11,12,13,14,15,16), why is that?

Also how can I add the age at the beginning of each line?

Hey @Giorgina99 so pandas DataFrames and for loops go together like water and oil—that is to say, not at all. Certainly, they generally work, for small problems like this at least, but they lead to code that is way more complicated (and thus prone to bugs) than it needs to be, and have far, far worse performance for larger data sets (say, a CSV of all the track results for all the schools in your country for a full year, which is not an uncommon thing when you’re working with data science problems in the real world—I regularly deal with datasets with millions of rows, and what I do would be considered “small” by some people). Instead, you want to use native numpy/pandas operations (vectorization) as much as possible, which leads to both simpler and much faster code. In fact, I was able to help a colleague speed up her data processing script by around one million times, mostly by replacing for loops with native pandas operations, and in fewer lines of code besides. It definitely takes a little more learning on the front end, but pays huge dividends in the long term.

Anyway, with that out of the way, let’s take a look at your code. Given your problem is that only the dictionary of the last athlete is returned, well, why might that be? if we look at your find_fastest_athlete function, we can see that a new dictionary is assigned to the athletes_data_dict variable with each run through the loop, replacing the previous one. Then, once the loop is finished, athletes_data_dict gets returned. Since a new dictionary is reassigned to the same variable each time, only the last value (in this case, the one for 16) gets returned.

If you want to return a dictionary with the age as keys and dictionaries of the athlete data as the values, you’ll need to create the dictionary above the for age in range(10, 16) for loop (e.g. all_ages = {} right below def find_fastest_athletes) and then after you create atheletes_data_dict inside the loop, add it to the dictionary with its age as key, e.g. all_ages[age] = athletes_data_dict, and return all_ages instead of the individual athletes_data_dict.

However, at least on my version of Pandas (1.3.4), regardless of that change, running your code fails with an IndexError when attempting to index athletes data into the atheletes_data_dict, because .split(' ') only returns a single string since the values in to_string are separated by one space, not two. Changing .split(' ') (two spaces) to .split(' ') (one space) gets it working for me, though I’m utterly baffled as to why you’re doing any of this at all. Furthermore, there’s another problem—you’re hardcoding 100m as the distance value when you are getting the specified columns in find_fastest_athletes, which will return the wrong times if distance is anything other than "100m" (which it happens to be in your example); that should be distance instead. And, range doesn’t return the last value, so your range(10, 16) should instead be range(10, 17) if you want it to return the values from [10, 16] inclusive (also, note your ages actually range all the way up to 19).

Beyond that, selecting the first value in the sorted df and the specified columns, then outputting this to a string, then splitting the string, then manually indexing the strings to get the values, then turning time back into a float, packing that all back into a dictionary, which (as the error I received demonstrates), is not only far more complicated than it needs to be, but also incredibly fragile—what if someone’s name has a space? What if you change the order of the columns? What if pandas changes how it formats the string? etc, etc…).

As a first step to simplifying this, we can simply eliminate the two lines before athletes_data_dict and just use iloc (to get the zeroth row) and loc (to get the column you want by name) on fastest_athletes_sorted directly when getting the values for athletes_data_dict. So, that would look like this:

        athletes_data_dict = {
            'forename': fastest_athletes_sorted.iloc[0].loc["forename"],
            'surname': fastest_athletes_sorted.iloc[0].loc["surname"],
            'time': fastest_athletes_sorted.iloc[0].loc[distance]
            }

However, we can make this even simpler (and faster), by removing the fastest_athletes_sorted line entirely and using df.idxmin to get the row with the minimum time on the specified column directly:

        fastest_athlete = fastest_athletes.loc[fastest_athletes[distance].idxmin()]
        athletes_data_dict = {
            'forename': fastest_athlete.loc["forename"],
            'surname': fastest_athlete.loc["surname"],
            'time': fastest_athlete.loc[distance]
            }

Yet, we can do better still. We’re manually constructing a dictionary from a DataFrame, which already has the columns and values we need. Instead of the previous athetes_data_dict line, we can use Pandas’ Series.to_dict method and a bit of column name replacement:

        athletes_data_dict = fastest_athlete[
            ["forename", "surname", distance]].rename({distance: "times"}).to_dict()

However, we can do better yet. We can filter athletes by gender only once, rather than every loop, since that’s a constant throughout the function. Furthermore, ages is passed into the function, but it is never used; presumably, you want to use it in your loop, and it itself can be a range instead of manually listing every age. So now, for our function, we have

def find_fastest_athletes(df, distance, gender, ages):
    all_ages = {}
    df_gender = df.loc[df["gender"] == gender]
    for age in ages:
        df_age = df_gender.loc[df_gender["age"] == age]
        fastest_athlete = df_age.loc[df_age[distance].idxmin()]
        all_ages[age] = fastest_athlete[
            ["forename", "surname", distance]].rename({distance: "times"}).to_dict()
    return all_ages

and our invocation,

    print(find_fastest_athletes(df, '100m', 'F', range(0, 17)))

However, here, we finally come full circle. We can make our remaining code much simpler and faster by not using a for loop at all. Instead, while we could use Pandas’ groupby function to group by age and find the fastest runners in each group, its even simpler to use a function you already did before—sort_values. All we need to do is select the columns with the ages and genders that we want, sort the values by the specified distance column, drop duplicates with the same age (i.e. those with higher times than the top runner), set the index and sort by it, and finally convert to a dict as we did before, setting the orientation to produce the format you want. This gets us:

def find_fastest_athletes(df, distance, gender, ages):
    df_byage = df.loc[(df["gender"] == gender) & df["age"].isin(ages)].sort_values(
        distance).drop_duplicates("age").set_index("age").sort_index()
    return df_byage[["forename", "surname", distance]].rename(
        columns={distance: "times"}).to_dict(orient="index")

This ends up being only 4 lines instead of the 13 of the original (corrected) function, runs about 5 times faster, gracefully handles cases where there is no runners of a specified age, and is far more robust to the various issues mentioned above.

There’s also some other things you can improve here. The code is mixing indent levels, with four spaces in some areas and two in others, which can be very confusing and is thus highly discouraged. instead, I suggest just using four spaces everywhere, which is the standard in Python. In main, the return is unnecessary, as in Python functions always return implicitly once their body is complete. So is df['100m'] = df['100m'].astype(float), as the Pandas already converts the column to a float when reading from the CSV given it contains non-integer numeric values.

Finally, when we put it all together, the fixed and simplified equivalent of your code above (with FILE_PATH as a top-level constant with the path to the file) that produces your desired results is:

def find_fastest_athletes(df, distance, gender, ages):
    df_byage = df.loc[(df["gender"] == gender) & df["age"].isin(ages)].sort_values(
        distance).drop_duplicates("age").set_index("age").sort_index()
    return df_byage[["forename", "surname", distance]].rename(
        columns={distance: "times"}).to_dict(orient="index")

def main(filename=FILE_PATH):
    df = pd.read_csv(filename, index_col=0)
    print(find_fastest_athletes(df, '100m', 'F', range(10, 17)))

if __name__ == "__main__":
    main()

Hopefully this all is helpful! I’m happy to answer any further questions. Cheers!

Hey @CAM-Gerlach,

Wow that’s really exhaustive and I really appreciate your detailed explanations, it helped me a lot to understand where my fault was, and, above all, how I can make the coding smoother.

Thanks again for your helpful insight,
Giorgina

1 Like

Thanks so much for your kind words, @Giorgina99 , and I’m really glad it was helpful! I hope it didn’t discourage you too much about your code; I wanted to guide you through some ways to improve your work further as opposed to just pointing out the few small bugs it. Cheers, and best of luck!

1 Like