Split the pandas dataframe by a column value

Hi

I have about 20 big pandas pandas data frame, each with 100,000 rows like this

a  b
1  0.3
1  0.5
......
2 3.4
......
2 0.4
......
1000 0.3
1000 4.5

where a = 1 for the first 100 rows, a = 2 for the second 100 rows, etc. Now I want to divide the dataframes into smaller ones, I tried

N = 1000
for i in range(1, 1+N):
    df_sub[i] = df[df['a' ]== i]

(In reality, there is one more loop beyond the above loop since I have about 20 such data frame. The presentation above is just to simplify my presentation)
However, it is super inefficient since the compiler has to search through the whole dataframe for each i. I wonder if there is any efficient way to do this, such as groupby function?

Thanks!!

J

Indeed, you’ve astutely identified that the solution with a for loop is not going to be very efficient especially for large dataframes like the one you’re using, due to the quadratic complexity of the search as well as the inherent efficiency of using a for-loop rather than built-in vectorized pandas functions. Of course, if you really want to split the dataframes, there’s always going to be some inefficiency having to operate separately on separate objects v.s just doing it on e.g. a single dataframe or groupby object, which you should really try to do if possible.

On that note, it seems this might be an XY problem—its not clear to me why or how you need the dataframes to be separated; its usually more efficient and powerful to use groupby or another construct for split-apply-combine workflows rather than entirely separate dataframes, and if you really do want to split them, what construct you organize them into instead depends on what specifically you want to do.

Also, in the future, please make sure you post a complete minimally reproducible example that we can run; it wasn’t to difficult to reproduce your example df, but example solution is not complete, as it uses df_sbu without defining it and its not clear what it is intended to be (a NumPy array? another dataframe? a Python list?) or you how created it.

In any case, let’s work with what we have, starting from a small dataframe based on your example:

import pandas as pd

df = pd.DataFrame(
    {"a": [1, 1, 2, 2, 3, 3], "b": [0.3, 0.5, 3.4, 0.4, 0.3, 4.5]})

To get your example working, I’ll tweak it to just use a list, and stick it into a function:

def split_initial(df, n_groups, colname="a"):
    df_sub = []
    for i in range(1, n_groups + 1):
        df_sub.append(df[df[colname] == i])
    return df_sub

We can simplify this a bit to use a comprehension, and to iterate directly over the unique column values rather than manually hardcoding a number of values under a certain set of assumptions:

def split_refined(df, colname="a"):
    return [df[df[colname] == value] for value in df[colname].unique()]

Of course, this is still inefficient, for the reasons previously mentioned. We can use groupby and getting the groups with get_group(), which should speed this up some:

def split_getgroup(df, colname="a"):
    gb = df.groupby(colname)
    return [gb.get_group(group) for group in gb.groups]

However, it can be simplified further, and likely made a good deal more efficient, by simply iterating on the groupby object directly and avoiding the lookup entirely:

def split_iterate(df, colname="a"):
    return [x for __, x in df.groupby(colname)]

Okay, let’s see how these perform, first on our small demo df above:

%timeit -n 100 -r 10 split_initial(df, 3)
1.77 ms ± 59.3 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit -n 100 -r 10 split_refined(df)
2.21 ms ± 108 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit -n 100 -r 10 split_getgroup(df)
2.41 ms ± 251 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit -n 100 -r 10 split_iterate(df)
987 µs ± 89.8 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

At this very small dataframe size, the time is dominated by the overhead of the function calls and object creation/description rather than the actual lookups. The first three versions are all similar, but the last is much faster, because (unlike the others) it avoids a for-loop lookup on the df entirely (even a O(N) one), and instead simply splits it into groups.

To get more meaningful results, let’s write a function to generate a larger dataframe:

from numpy.random import default_rng

def generate_test_data(n_groups, n_per_group):
    return pd.DataFrame({
        "a": np.repeat(np.arange(1, n_groups + 1), n_per_group),
        "b": default_rng().random(n_groups * n_per_group),
        })

Calling it with

df = generate_test_data(n_groups=100, n_per_group=10)

to get some test data of intermediate size, we can benchmark it again:

%timeit -n 10 -r 10 split_initial(df, 100)
60.2 ms ± 1.87 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

%timeit -n 10 -r 10 split_refined(df)
63.8 ms ± 2.8 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

%timeit -n 10 -r 10 split_getgroup(df)
40.2 ms ± 3.33 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

%timeit -n 10 -r 10 split_iterate(df)
5.65 ms ± 566 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

Here, we see the original and refined versions doing about the same, the getgroup() version showing a moderate speedup, and the version iterating directly on the groupby object being over 10 times after than the original.

Finally, reproducing a dataframe the size of your actual data, or 100x larger than the previous:

df = generate_test_data(n_groups=1000, n_per_group=100)

We can run our benchmarks once again:

%timeit -n 1 -r 10 split_initial(df, 1000)
831 ms ± 21.4 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

%timeit -n 1 -r 10 split_refined(df)
1.06 s ± 121 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

%timeit -n 1 -r 10 split_getgroup(df)
347 ms ± 17.9 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

%timeit -n 1 -r 10 split_iterate(df)
56 ms ± 4.88 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

Here, we again see the initial and refined versions perform similarly, as expected, with again a bit of additional overhead from the df[colname].unique() call (outweighing the efficiency boost from the comprehension). The groupby version sees a decent 2-3x improvement, due to the lookups occurring in the already-grouped and ordered groupby object rather than on the original df. The version iterating on the groupby object that shows a large, nearly 20x improvement—on 20 such dataframes as you have, this operation will take only 1 second rather than 20 s.

I do note that depending on what you actually want to do, this is still not terribly efficient in the ground scheme of things. If, suppose, you wanted to calculate the mean of each of the 20 dfs, even with the most efficient variation of the above, it would still take several seconds (and of course, much longer for more complex, realistic operations), whereas using groupby directly is much faster:

%timeit -n 1 -r 10 [sub_df["b"].mean() for sub_df in split_iterate(df)]
316 ms ± 10.8 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

%timeit -n 1 -r 10 df.groupby("a").mean()
6.55 ms ± 1.16 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

…Nearly 50x faster, in fact.

1 Like