The behavior of the groupby()
function from itertools
is documented as returning “sub-iterators grouped by value of key(v)”. It is further clarified in the detailed documentation that the function “Make[s] an iterator that returns consecutive keys and groups from the iterable.”
The behavior of this function is valid, documented, and useful. It performs a meaningful task and uses minimal memory.
Despite that, I argue that its current name is misleading.
A python developer naive to the nuances of this function might understandably assume that it groups all items with a matching key from the input list. Thus they might attempt something like this and be confounded by the result:
>>> data = ['aa', 'bbb', 'cc']
>>> grouped = {k: list(v) for k, v in groupby(data, key=len)}
>>> print(grouped)
{2: ['cc'], 3: ['bbb']}
This is especially surprising because it will perform exactly how the naive developer expects it to if data
is already sorted. When it is not sorted, values are silently omitted from the resulting dictionary.
While the above code is correct and performs accurately to the documented behavior, I argue that the well-understood behavior for a “group by” operation involves aggregating all the elements of the input sequence, not just matching sequential elements.
This understanding is shared by many ubiquitous languages and technologies:
- SQL’s
GROUP BY
expression - Javascript’s
Object.groupBy()
- C#'s
System.Linq.Enumerable.GroupBy
- Java’s
Collectors.groupingBy
- Ruby’s Enumerable
group_by
- R’s dyplr library’s
group_by
- Swift’s
Dictionary init(grouping:by:)
- Microsoft Excel’s
GROUPBY
(I unfortunately can’t post links to documentation for the above because of forum restrictions)
Conversely, the itertools groupby behavior only appears by that name in Haskell:
- Haskell’s
Data.List.GroupBy
shares the same name and behaves similarly togroupby
initertools
.
I propose the following in accordance with PEP 5:
- Officially deprecate the
groupby
function in itertools - Add a function to itertools named
chunk_by()
that behaves identically to the currentgroupby
function.
C++'s ranges library std::ranges::views
uses the chunk_by
terminology to describe this same behavior, so this name is not unprecedented.
As it stands the naming of groupby
in itertools
is antithetical to python’s ideals of readability and unsurprising conventions. It behaves differently than the majority of existing technology in a surprising and silent way that can trip up even seasoned developers. A more carefully descriptive name would eliminate much of the uncertainty around its behavior and more clearly describe the behavior of the code.
I appreciate that groupby
has been around for many many years and this probably has a snowball’s chance in hell of ever happening, but as it is currently implemented it remains a surprising pitfall.