Multiple loop in comprehension

stex · March 23, 2023, 10:22am

Hello,
I have a code working fine but it’s slow to handle millions of line in my dataframe. So I’m trying to use list comprehension but maybe it’s not the right solution. Any function that could enhance it would be greatly appreciated.
My initial dataframe is a permutation of a list of people with the list of workdays needed on that day

df = pd.DataFrame(np.array(list(itertools.permutations(Jeff,len(Lbo)))), columns=Lbo)

It is then converted to a dataframe wich lists the options by name in the columns

P = pd.DataFrame(columns = Leff, index=df.index)
    for k in range(0,len(df)):
        for column in P.columns:
            L = df.iloc[k].values.tolist()
            for l in range(0,len(L)):
                if column == L[l]:
                    P[column].iloc[k] = df.columns[l]

Now I want to get read of the loops as much as possible and I tried

for k in range(0,len(df)):
        P.iloc[k]= [df.columns[l].iloc[k] if any(column == df.iloc[k].values.tolist()[l]) else pd.NA for l in range(0,len(df.columns)) for column in P.columns]

But it issues an error ‘bool’ object is not iterable…

Can anyone help me with an elegant solution to my problem ?

CAM-Gerlach · March 23, 2023, 4:39pm

You can think of a list comprehension as just a slightly specialized version of a for-loop that handles creating and appending elements to a new list for you. It’ll be a bit cleaner and perform a bit better, but from the perspective of NumPy/Pandas, it’s all the same thing, and is going to perform orders of magnitude worse than native NumPy/Pandas vectorized operations on larger arrays/dataframes, especially when you’re nesting it 3 (!) levels deep like here.

Instead of a for loop/comprehension (which, at a high level, are basically different spellings of a very same thing for your purposes), you want to use native vectorized NumPy/Pandas operations to do what you want, at least as many layers of loops as you can (innermost first). These do what you want all in one go, which is both cleaner and far faster than manually iterating (sometimes by millions of times).

In this case, your above example is not reproducible or complete—it references variables Jeff, Lbo, and Leff (typo?), none of which are defined, and their names not very descriptive, so it is difficult for any reader to know what they reprisent. Also, the second code block fails to parse with a SyntaxError, because the first for line contains a spurious level of indentation. You should always make sure you can copy and paste your examples into a new file and they actually work, or we will not be able to actually use your code without manually trying to fix it and guess what you meant, which is not great for either you or us.

However, I’m just going to assume the Jeff is the list of people and Lbo is the list of workdays (the code is no different if you swap them), and that Leff is a mispelling of Jeff (or vice versa), as it isn’t obvious how the code is intended to work otherwise. I’m also going to assume len(Jeff) > len(Lbo), as if they were equal, you could simply swap the names in the initial call creating the dataframe to get an identical dataframe as your for-loop results in, and if they were less, your code would fail with an error. So, for test purposes, I’ll assume:

Jeff = ["Amina", "Bob", "Cristina", "Deshawn"]
Lbo = [f"Day {n}" for n in range(1, 4)]
df = pd.DataFrame(itertools.permutations(Jeff, len(Lbo)), columns=Lbo)

(Note that I eliminated redundant list and np.array calls)

So, we have:

>>> print(df)
       Day 1     Day 2     Day 3
0      Amina       Bob  Cristina
1      Amina       Bob   Deshawn
2      Amina  Cristina       Bob
3      Amina  Cristina   Deshawn
4      Amina   Deshawn       Bob
5      Amina   Deshawn  Cristina
6        Bob     Amina  Cristina
7        Bob     Amina   Deshawn
8        Bob  Cristina     Amina
9        Bob  Cristina   Deshawn
10       Bob   Deshawn     Amina
11       Bob   Deshawn  Cristina
12  Cristina     Amina       Bob
13  Cristina     Amina   Deshawn
14  Cristina       Bob     Amina
15  Cristina       Bob   Deshawn
16  Cristina   Deshawn     Amina
17  Cristina   Deshawn       Bob
18   Deshawn     Amina       Bob
19   Deshawn     Amina  Cristina
20   Deshawn       Bob     Amina
21   Deshawn       Bob  Cristina
22   Deshawn  Cristina     Amina
23   Deshawn  Cristina       Bob

Running your corrected block of example code:

P = pd.DataFrame(columns = Jeff, index=df.index)
for k in range(0,len(df)):
    for column in P.columns:
        L = df.iloc[k].values.tolist()
        for l in range(0,len(L)):
            if column == L[l]:
                P[column].iloc[k] = df.columns[l]

results in

>>> print(P)
    Amina    Bob Cristina Deshawn
0   Day 1  Day 2    Day 3     NaN
1   Day 1  Day 2      NaN   Day 3
2   Day 1  Day 3    Day 2     NaN
3   Day 1    NaN    Day 2   Day 3
4   Day 1  Day 3      NaN   Day 2
5   Day 1    NaN    Day 3   Day 2
6   Day 2  Day 1    Day 3     NaN
7   Day 2  Day 1      NaN   Day 3
8   Day 3  Day 1    Day 2     NaN
9     NaN  Day 1    Day 2   Day 3
10  Day 3  Day 1      NaN   Day 2
11    NaN  Day 1    Day 3   Day 2
12  Day 2  Day 3    Day 1     NaN
13  Day 2    NaN    Day 1   Day 3
14  Day 3  Day 2    Day 1     NaN
15    NaN  Day 2    Day 1   Day 3
16  Day 3    NaN    Day 1   Day 2
17    NaN  Day 3    Day 1   Day 2
18  Day 2  Day 3      NaN   Day 1
19  Day 2    NaN    Day 3   Day 1
20  Day 3  Day 2      NaN   Day 1
21    NaN  Day 2    Day 3   Day 1
22  Day 3    NaN    Day 2   Day 1
23    NaN  Day 3    Day 2   Day 1

Now, if you know a priori that df is just the permutations of Jeff over Lbo (which, at least going off what you’ve stated, you do), you can simply construct P directly without df, knowing only Jeff and Lbo by just calling the original itertools.permutations with swapped arguments. The only complexity is just manually padding Lbo with NaNs to the proper length:

Lbo_padded = Lbo + [np.nan] * max(0, len(Jeff) - len(Lbo))
P2 = pd.DataFrame(itertools.permutations(Lbo_padded, len(Jeff)), columns=Jeff)

You can see you get the same result as your code above (at least, ignoring the row order):

P, P2 = (df.sort_values(by=list(df.columns.values)).reset_index(drop=True) for df in (P, P2))
print(P.equals(P2))

Even on the small example dataframe above, this direct approach is fully 100x faster than your original for-loop solution (369 µs vs 36.4 ms, not counting the dataframe creation time for the original solution):

%timeit original_solution()
36.4 ms ± 719 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit new_solution()
369 µs ± 466 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

If we scale this up a modest amount to 6 names by 6 days (720 rows x 6 columns):

Jeff = [f"Name {n}" for n in range(6)]
Lbo = [f"Day {n}" for n in range(5)]

Then the direct creation solution is over 1000x faster (1.52 s vs 1.07 ms):

%timeit original_solution()
1.52 s ± 8.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit new_solution()
1.07 ms ± 6.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Likewise, on a dataframe of 8 names x 7 days (40k rows x 7 columns), its nearly 10000x faster.

%timeit -r 1 -n 1 original_solution()
2min 56s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%timeit new_solution()
20.7 ms ± 72.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

If for whatever reason this does not satisfy the (unstated) constraints of the problem, there are other possible solutions, but you’ll need to specify those first

stex · March 24, 2023, 6:45am

Thanks of a lot for your very comprehensive response
You guessed allright concerning the missing par of my code. Lbo is actually something like morning shif n°1, morning shift n°2, evening shift… The Dataframe gets big once I iterate through the week. And the difference between Leff and Jeff is that Leff is the whole team of people while Jeff are the one potentially working on that day (it is dealt with another function). I’ll just add the missing names in Jeff on the resulting DataFrame with their respective holidays or sick day.
For the rest padding Lbo and do the permutation with NaN values works amazingly compared to my for loops

Topic		Replies	Views
How to iterate through dataframe column Python Help	0	565	June 9, 2021
Various length dataframe to extract (or split) Python Help	3	306	May 11, 2022
For loop in dataframe in pandas Python Help	3	1313	December 3, 2021
Facing Issue with DATAFRAME Python Help help	4	343	April 27, 2023
Find the intersection of one list and one dataframe Python Help	2	538	October 20, 2023

Multiple loop in comprehension

Related Topics