Many Python libraries related to data science support concept of boolean indexing (Indexing on ndarrays — NumPy v1.22 Manual). I would like to propose limited version of this concept that could be useful to regular Python programmers.
Sometimes one doesn’t want to (or can’t because of business requirements) import libraries like pandas or numpy with all their bells and whistles, but still wants performant and concise way of filtering lists.
Idea
My idea is to allow indexing lists using lists of booleans of the same length.
To give an example:
a=[1,2,3]
b=[True,False,True]
print(a[b])#output: [1,3]
Same would also work for assignments. Assigned list length would need to be equal to count of True in indices:
a=[1,2,3]
b=[True,False,True]
a[b]=[-1,-2]
print(a)#output: [-1,2,-2]
I wrote a short sample patch specifically for lists. I guess concept might be extensible to other sequences, but wanted to see if idea gets any traction first.
Benchmark
Using the included patch I wrote short benchmarks;
-
“getitem”
#most pythonic version I could think of for _ in range(1_000_000): #data in "columnar format" col_1=list(range(200)) col_2=list(range(100,300)) col_3_has_some_feature=[True ,False]*100 #filtering data foo=[x for x,y in zip(col_1,col_3_has_some_feature) if y] bar=[x for x,y in zip(col_2,col_3_has_some_feature) if y]
running time (ubuntu 20.04, i7-4700hq): 32.80s
#proposed syntax with my patch for _ in range(1_000_000): #data in "columnar format" col_1=list(range(200)) col_2=list(range(100,300)) col_3_has_some_feature=[True ,False]*100 #filtering data foo=col_1[col_3_has_some_feature] bar=col_2[col_3_has_some_feature]
running time (ubuntu 20.04, i7-4700hq): 12.05s
-
“setitem”
those two are not exactly equivalent as my version allocates memory for one more list in last line (getitem indexing allocates new list similar as with slicing)
# python without proposed syntax for _ in range(1_000_000): # data in "columnar format" col_1=list(range(200)) col_2=list(range(100,300)) col_3_has_some_feature=[True ,False]*100 #rewriting data where feature exists for i,x in enumerate(col_3_has_some_feature): if x: col_1[i]=col_2[i]
running time (ubuntu 20.04, i7-4700hq): 50.14s
# proposed syntax with my patch for _ in range(1_000_000): #data in "columnar format" col_1=list(range(200)) col_2=list(range(100,300)) col_3_has_some_feature=[True ,False]*100 #rewriting data where feature exists col_1[col_3_has_some_feature]=col_2[col_3_has_some_feature]
running time (ubuntu 20.04, i7-4700hq): 11.92s
My code is proof of concept so I expect final version would be slower due to safety checks, yet to me conciseness and performance benefits look interesting.
If there is interest in this concept I could spend some time on implementation. Let me know what you think.