Add basic boolean indexing to list

Many Python libraries related to data science support concept of boolean indexing (Indexing on ndarrays — NumPy v1.22 Manual). I would like to propose limited version of this concept that could be useful to regular Python programmers.

Sometimes one doesn’t want to (or can’t because of business requirements) import libraries like pandas or numpy with all their bells and whistles, but still wants performant and concise way of filtering lists.

Idea

My idea is to allow indexing lists using lists of booleans of the same length.

To give an example:

a=[1,2,3]
b=[True,False,True]

print(a[b])#output: [1,3]

Same would also work for assignments. Assigned list length would need to be equal to count of True in indices:

a=[1,2,3]
b=[True,False,True]
a[b]=[-1,-2]

print(a)#output: [-1,2,-2]

I wrote a short sample patch specifically for lists. I guess concept might be extensible to other sequences, but wanted to see if idea gets any traction first.

Benchmark

Using the included patch I wrote short benchmarks;

  1. “getitem”

    #most pythonic version I could think of
    
    for _ in range(1_000_000):
         #data in "columnar format"
         col_1=list(range(200))
         col_2=list(range(100,300))
         col_3_has_some_feature=[True ,False]*100
    
    
         #filtering data
         foo=[x for x,y in zip(col_1,col_3_has_some_feature) if y]
         bar=[x for x,y in zip(col_2,col_3_has_some_feature) if y]
    

    running time (ubuntu 20.04, i7-4700hq): 32.80s

    #proposed syntax with my patch
    
    for _ in range(1_000_000):
        #data in "columnar format"
        col_1=list(range(200))
        col_2=list(range(100,300))
        col_3_has_some_feature=[True ,False]*100
    
        #filtering data
        foo=col_1[col_3_has_some_feature]
        bar=col_2[col_3_has_some_feature]
    

    running time (ubuntu 20.04, i7-4700hq): 12.05s

  2. “setitem”

    those two are not exactly equivalent as my version allocates memory for one more list in last line (getitem indexing allocates new list similar as with slicing)

    # python without proposed syntax
    
    for _ in range(1_000_000):
        # data in "columnar format"
        col_1=list(range(200))
        col_2=list(range(100,300))
        col_3_has_some_feature=[True ,False]*100
    
    
        #rewriting data where feature exists
        for i,x in enumerate(col_3_has_some_feature):
            if x:
                col_1[i]=col_2[i]
    

    running time (ubuntu 20.04, i7-4700hq): 50.14s

    # proposed syntax with my patch
    
    for _ in range(1_000_000):
        #data in "columnar format"
        col_1=list(range(200))
        col_2=list(range(100,300))
        col_3_has_some_feature=[True ,False]*100
    
        #rewriting data where feature exists
        col_1[col_3_has_some_feature]=col_2[col_3_has_some_feature]
    

    running time (ubuntu 20.04, i7-4700hq): 11.92s

My code is proof of concept so I expect final version would be slower due to safety checks, yet to me conciseness and performance benefits look interesting.

If there is interest in this concept I could spend some time on implementation. Let me know what you think.

Doesn’t itertools.compress()
already provide this ?

1 Like

This one is nice, thanks for providing link. It also has great performance almost identical to my custom solution.

However I think it does not work for scenario where indexing is done on the left side of assignment, allowing to replace only some elements.

Python already has similar notation for slices:

a=[1,2,3,4]
a[2:6]=[0,0,0,0]
print(a) # output: [1, 2, 0, 0, 0, 0]

so I thought extending it would feel quite natural even for people who are not daily pandas/numpy/etc. users.