Many Python libraries related to data science support concept of boolean indexing (Indexing on ndarrays — NumPy v1.22 Manual). I would like to propose limited version of this concept that could be useful to regular Python programmers.
Sometimes one doesn’t want to (or can’t because of business requirements) import libraries like pandas or numpy with all their bells and whistles, but still wants performant and concise way of filtering lists.
Idea
My idea is to allow indexing lists using lists of booleans of the same length.
I wrote a short sample patch specifically for lists. I guess concept might be extensible to other sequences, but wanted to see if idea gets any traction first.
Benchmark
Using the included patch I wrote short benchmarks;
“getitem”
#most pythonic version I could think of
for _ in range(1_000_000):
#data in "columnar format"
col_1=list(range(200))
col_2=list(range(100,300))
col_3_has_some_feature=[True ,False]*100
#filtering data
foo=[x for x,y in zip(col_1,col_3_has_some_feature) if y]
bar=[x for x,y in zip(col_2,col_3_has_some_feature) if y]
running time (ubuntu 20.04, i7-4700hq): 32.80s
#proposed syntax with my patch
for _ in range(1_000_000):
#data in "columnar format"
col_1=list(range(200))
col_2=list(range(100,300))
col_3_has_some_feature=[True ,False]*100
#filtering data
foo=col_1[col_3_has_some_feature]
bar=col_2[col_3_has_some_feature]
running time (ubuntu 20.04, i7-4700hq): 12.05s
“setitem”
those two are not exactly equivalent as my version allocates memory for one more list in last line (getitem indexing allocates new list similar as with slicing)
# python without proposed syntax
for _ in range(1_000_000):
# data in "columnar format"
col_1=list(range(200))
col_2=list(range(100,300))
col_3_has_some_feature=[True ,False]*100
#rewriting data where feature exists
for i,x in enumerate(col_3_has_some_feature):
if x:
col_1[i]=col_2[i]
running time (ubuntu 20.04, i7-4700hq): 50.14s
# proposed syntax with my patch
for _ in range(1_000_000):
#data in "columnar format"
col_1=list(range(200))
col_2=list(range(100,300))
col_3_has_some_feature=[True ,False]*100
#rewriting data where feature exists
col_1[col_3_has_some_feature]=col_2[col_3_has_some_feature]
running time (ubuntu 20.04, i7-4700hq): 11.92s
My code is proof of concept so I expect final version would be slower due to safety checks, yet to me conciseness and performance benefits look interesting.
If there is interest in this concept I could spend some time on implementation. Let me know what you think.
Maybe allowing slice to be subclassed or imitated somehow ?
That is, having a method (__slice__ ?) of the slice object return an ordered set of indices, and that method could be overridden. Natively it would be something like return [k for k in range(self.start, self.end, self.step)]. And objects implementing that method, or possibly only subclasses of slice, would be used that way by native types.
Since bool is a subclass of int I think we’re on very thin ice if you want a[[1, 3, 5]] to have one meaning and a[[True, True, False]] a different meaning. The latter would be expected to behave the same as a[[1, 1, 0]]. Even if numpy/pandas make this distinction, this is pretty fundamental to Python’s bool – we cannot easily stop it from subclassing int without breaking vast amounts of user code.
That’s a good point. As a person that heavily uses numpy/pandas daily I never thought of it this way. Yet speed benefits seem pretty significant to me. What about creating a separate function (or list’s method) that works like itertools.compress, but for assignments?
Is there a benefit to making it a method on mutable sequences? That would imply that each mutable sequence implementation would have to implement its own version of this. If it was a helper function (taking a mutable sequence as argument) generic support would be easier obtained. The helper function could still have a special case for lists for performance, if needed.
I do doubt that this would get much use – numpy arrays are just different from lists – so I am not excited about it, and I don’t see much excitement in this thread either.
Personally, I’m not sure I see much of a use case that isn’t already handled by Numpy/pandas/xarray in the types of real-world use that would need this, and they already have their own solutions (which I use heavily). But as a start, you could write a helper function along the lines of what @guido suggested, publish that as a package on PyPI, and see what sort of uptake/feedback you get.