Itertools groupby should be renamed to chunk_by

The behavior of the groupby() function from itertools is documented as returning “sub-iterators grouped by value of key(v)”. It is further clarified in the detailed documentation that the function “Make[s] an iterator that returns consecutive keys and groups from the iterable.”

The behavior of this function is valid, documented, and useful. It performs a meaningful task and uses minimal memory.

Despite that, I argue that its current name is misleading.

A python developer naive to the nuances of this function might understandably assume that it groups all items with a matching key from the input list. Thus they might attempt something like this and be confounded by the result:

>>> data = ['aa', 'bbb', 'cc']
>>> grouped = {k: list(v) for k, v in groupby(data, key=len)}
>>> print(grouped)
{2: ['cc'], 3: ['bbb']}

This is especially surprising because it will perform exactly how the naive developer expects it to if data is already sorted. When it is not sorted, values are silently omitted from the resulting dictionary.

While the above code is correct and performs accurately to the documented behavior, I argue that the well-understood behavior for a “group by” operation involves aggregating all the elements of the input sequence, not just matching sequential elements.

This understanding is shared by many ubiquitous languages and technologies:

  • SQL’s GROUP BY expression
  • Javascript’s Object.groupBy()
  • C#'s System.Linq.Enumerable.GroupBy
  • Java’s Collectors.groupingBy
  • Ruby’s Enumerable group_by
  • R’s dyplr library’s group_by
  • Swift’s Dictionary init(grouping:by:)
  • Microsoft Excel’s GROUPBY

(I unfortunately can’t post links to documentation for the above because of forum restrictions)

Conversely, the itertools groupby behavior only appears by that name in Haskell:

  • Haskell’s Data.List.GroupBy shares the same name and behaves similarly to groupby in itertools.

I propose the following in accordance with PEP 5:

  1. Officially deprecate the groupby function in itertools
  2. Add a function to itertools named chunk_by() that behaves identically to the current groupby function.

C++'s ranges library std::ranges::views uses the chunk_by terminology to describe this same behavior, so this name is not unprecedented.

As it stands the naming of groupby in itertools is antithetical to python’s ideals of readability and unsurprising conventions. It behaves differently than the majority of existing technology in a surprising and silent way that can trip up even seasoned developers. A more carefully descriptive name would eliminate much of the uncertainty around its behavior and more clearly describe the behavior of the code.

I appreciate that groupby has been around for many many years and this probably has a snowball’s chance in hell of ever happening, but as it is currently implemented it remains a surprising pitfall.

1 Like

The proposed name change doesn’t seem like a big improvement to me: groupby and chunk_by sound to me like basically the same thing. But I’ve never had a need to compare them to other languages.

And you’re right, there’s not much chance of this change being accepted. The churn isn’t worth whatever small improvement would be gained. For my own work, I’d have to change many dozens or maybe even 100 or so files. That would not make my customers happy: money spent for no business value.

4 Likes

When it is
not sorted, values are /silently omitted/ from the resulting dictionary.

Data is only being omitted because you’re putting the result into
a dict. The groupby function itself is not omitting anything:

[(k, list(v)) for k, v in groupby(data, key=len)]
[(2, [‘aa’]), (3, [‘bbb’]), (2, [‘cc’])]

3 Likes

To be clear, what is the output you are expecting from your code example?

Again, to be 100% clear here, I’m not saying the behavior of this function should change at all. I understand exactly what it’s doing and why the example code does not work as intended. Nothing about this function is wrong. The example I gave was to illustrate how this function can be easily mis-used because of the common understanding of what the “group by” behavior implies. My only proposal here is to adjust the name of this function to be less ambiguous.

The need to sort before invocation, was very jarring coming from pandas groupby operations being my first exposure.

From what I’ve learned in pandas, groupby operations can only be performed on data that supports rich comparisons and can be hashed. I’ve discovered this from storing various data classes, that I didn’t configure properly, in data frame columns and hit errors on both requirements.

Maybe someone familiar with SQL can chime in on that version.

With all this said, I can see the point op is making. Although it’s far too late to change the behavior.

I do wonder why it was decided not to sort the iterable internally on the groupby, maybe bc it’s too eager.

1 Like

I assume it’s an efficiency thing: if the data is already sorted, no need to sort it again. For example, I use groupby on the results of database queries containing millions of rows. I sort them on the database side, so no need to load them all into client-side memory to sort them.

itertools.groupby seems like an appropriate name for the functionality. Maybe an appropriate name for the behavior you want would be collections.groupby?

(Btw I think I’ve seen a previous topic/issue about this where someone (Raymond?) showed a rough timeline, i.e., when each of several programming languages got a “groupby”, and Python was among the first, but I can’t find it … Anyone?)

It’s because groupby takes an iterator. Not all iterators can be sorted (infinite ones, for example) and for some that can, you may not want to consume them all (if you only want the first group, for example).

7 Likes
from itertools import group_by as chunk_by

Done.

5 Likes

Sometimes the solution really is that simple huh?

1 Like

I was thinking of posting the same thing in jest. (Except groupby instead of group_by of course).

I acknowledge the mental shift from datascience groupby’s to this one, but it’s just too late in the game to change the name. Way too many tools expect itertools.groupby to behave the way it does. I could potentially reason aliasing groupby to another name, but not deprecating it.