Add basic boolean indexing to list

chalggg · April 18, 2022, 6:23pm

Many Python libraries related to data science support concept of boolean indexing (Indexing on ndarrays — NumPy v1.22 Manual). I would like to propose limited version of this concept that could be useful to regular Python programmers.

Sometimes one doesn’t want to (or can’t because of business requirements) import libraries like pandas or numpy with all their bells and whistles, but still wants performant and concise way of filtering lists.

Idea

My idea is to allow indexing lists using lists of booleans of the same length.

To give an example:

a=[1,2,3]
b=[True,False,True]

print(a[b])#output: [1,3]

Same would also work for assignments. Assigned list length would need to be equal to count of True in indices:

a=[1,2,3]
b=[True,False,True]
a[b]=[-1,-2]

print(a)#output: [-1,2,-2]

I wrote a short sample patch specifically for lists. I guess concept might be extensible to other sequences, but wanted to see if idea gets any traction first.

Benchmark

Using the included patch I wrote short benchmarks;

“getitem”

#most pythonic version I could think of

for _ in range(1_000_000):
     #data in "columnar format"
     col_1=list(range(200))
     col_2=list(range(100,300))
     col_3_has_some_feature=[True ,False]*100


     #filtering data
     foo=[x for x,y in zip(col_1,col_3_has_some_feature) if y]
     bar=[x for x,y in zip(col_2,col_3_has_some_feature) if y]

running time (ubuntu 20.04, i7-4700hq): 32.80s

#proposed syntax with my patch

for _ in range(1_000_000):
    #data in "columnar format"
    col_1=list(range(200))
    col_2=list(range(100,300))
    col_3_has_some_feature=[True ,False]*100

    #filtering data
    foo=col_1[col_3_has_some_feature]
    bar=col_2[col_3_has_some_feature]

running time (ubuntu 20.04, i7-4700hq): 12.05s

“setitem”

those two are not exactly equivalent as my version allocates memory for one more list in last line (getitem indexing allocates new list similar as with slicing)

# python without proposed syntax

for _ in range(1_000_000):
    # data in "columnar format"
    col_1=list(range(200))
    col_2=list(range(100,300))
    col_3_has_some_feature=[True ,False]*100


    #rewriting data where feature exists
    for i,x in enumerate(col_3_has_some_feature):
        if x:
            col_1[i]=col_2[i]

running time (ubuntu 20.04, i7-4700hq): 50.14s

# proposed syntax with my patch

for _ in range(1_000_000):
    #data in "columnar format"
    col_1=list(range(200))
    col_2=list(range(100,300))
    col_3_has_some_feature=[True ,False]*100

    #rewriting data where feature exists
    col_1[col_3_has_some_feature]=col_2[col_3_has_some_feature]

running time (ubuntu 20.04, i7-4700hq): 11.92s

My code is proof of concept so I expect final version would be slower due to safety checks, yet to me conciseness and performance benefits look interesting.

If there is interest in this concept I could spend some time on implementation. Let me know what you think.

malemburg · April 18, 2022, 7:19pm

Doesn’t itertools.compress()
already provide this ?

chalggg · April 18, 2022, 7:33pm

This one is nice, thanks for providing link. It also has great performance almost identical to my custom solution.

However I think it does not work for scenario where indexing is done on the left side of assignment, allowing to replace only some elements.

Python already has similar notation for slices:

a=[1,2,3,4]
a[2:6]=[0,0,0,0]
print(a) # output: [1, 2, 0, 0, 0, 0]

so I thought extending it would feel quite natural even for people who are not daily pandas/numpy/etc. users.

Gouvernathor · July 8, 2022, 1:14am

Maybe allowing slice to be subclassed or imitated somehow ?
That is, having a method (__slice__ ?) of the slice object return an ordered set of indices, and that method could be overridden. Natively it would be something like return [k for k in range(self.start, self.end, self.step)]. And objects implementing that method, or possibly only subclasses of slice, would be used that way by native types.

guido · July 10, 2022, 10:25pm

Since bool is a subclass of int I think we’re on very thin ice if you want a[[1, 3, 5]] to have one meaning and a[[True, True, False]] a different meaning. The latter would be expected to behave the same as a[[1, 1, 0]]. Even if numpy/pandas make this distinction, this is pretty fundamental to Python’s bool – we cannot easily stop it from subclassing int without breaking vast amounts of user code.

chalggg · July 11, 2022, 7:41pm

That’s a good point. As a person that heavily uses numpy/pandas daily I never thought of it this way. Yet speed benefits seem pretty significant to me. What about creating a separate function (or list’s method) that works like itertools.compress, but for assignments?

Sth like this:

a=[1,2,3,4]
b=[True,False,True,False]
replacement_values=[0,0]
a.boolean_replace(b,replacement_values)
print(a) #output: [0,2,0,4]

(the name of course is a material for separate discussion).

guido · July 11, 2022, 8:29pm

Is there a benefit to making it a method on mutable sequences? That would imply that each mutable sequence implementation would have to implement its own version of this. If it was a helper function (taking a mutable sequence as argument) generic support would be easier obtained. The helper function could still have a special case for lists for performance, if needed.

I do doubt that this would get much use – numpy arrays are just different from lists – so I am not excited about it, and I don’t see much excitement in this thread either.

CAM-Gerlach · July 24, 2022, 6:47pm

Personally, I’m not sure I see much of a use case that isn’t already handled by Numpy/pandas/xarray in the types of real-world use that would need this, and they already have their own solutions (which I use heavily). But as a start, you could write a helper function along the lines of what @guido suggested, publish that as a package on PyPI, and see what sort of uptake/feedback you get.