Syntactic sugar for parent in setitem and getitem to improve boolean indexing

amtrakcipher · March 31, 2022, 9:42pm

Boolean indexing is an extremely common and powerful pattern in Python’s scientific/data science stack. It’s a very expressive, powerful way to manipulate datasets, and it’s present in other languages like Matlab.

One of the most common uses of this syntax is to do boolean indexing with a mask variable which is calculated based on some manipulation of the original variable. For instance, in numpy or pandas, you’ll see a lot of code that looks like this:

dataframe[(dataframe.column_1 > 5) & (dataframe.column_2< 10)]

This is fine if variable names are short (like df), but results in pretty unwieldy code if the variable names are long. Other languages have syntax that allows for much less repetition when filtering data like this. For instance, R’s filter function allows you to omit the name of the parent dataframe, or Scala’s anonymous function syntax which allows the equivalent of a python lambda function in one character. Given Python’s prominence in the science and data science worlds, this feels like a major ergonomic disadvantage.

The proposal is to introduce a special character (or very short keyword) that would, in the context of index brackets, would be parsed to refer to the object that the indexing is happening on. For instance, if we chose $ as our self-reference symbol, we could rewrite the above expression as

dataframe[($.column_1 > 5) & ($.column_2< 10)]

I can attest to having read and written a lot of code that would be easier to read with this feature, and it’s a small unambiguous change that would offer automatic downstream benefits for many major python packages.

CAM-Gerlach · April 1, 2022, 3:51am

As a scientist who makes heavy use of pandas, this would be a huge usability boost for Python’s largest (and most underreprisented) user community, and one of the biggest things that make it more painful than R.

While using a somewhat different approach than the above, PEP 637 was opened that would address this issue by allowing indexing for keyword arguments. Unfortunately, as I understand, it apparently didn’t make a compelling enough case on the major, immediate practical benefits for scientific computation, and didn’t effectively rally enough support behind it form the scientific community and the major packages it would affect, and was sadly rejected. Maybe @steven.daprano , who was the PEP’s sponsor, or @stefanoborini , its author, could explain more.

I don’t disagree about the potential benefits of something to address this case, but as evidenced by the rejection of the much more limited PEP 677, or the brohaha around assignment expressions or the match statement, introducing a whole new symbol to the syntax of the world’s most popular programming language is anything but a “small unambiguous change”.

gkb · April 1, 2022, 5:45am

Note that in pandas, you can also use the query method to abbreviate your query:

dataframe.query("column_1 > 5 and column_2 < 10")

stefanoborini · April 1, 2022, 7:22am

There’s not much to discuss. I am still in support of the PEP, and I (together with another core developer) developed a working implementation, but the PEP was simply rejected due to lack of perceived benefit.

amtrakcipher · April 1, 2022, 1:59pm

Fair point on not being a small change, but I do think this approach has much less ambiguity than something like the PEP 637 approach (as a result of being much less flexible). There’s no ability for packages to implement or override this behavior; it’s just an alias. Granted it’s hard to choose new symbols that aren’t confusing to some, but I think this is as straightforward as language changes get.

In any case, if there’s just not enough community appetite to solve this problem, it’s a shame.

amtrakcipher · April 1, 2022, 2:02pm

I think this highlights the problem exactly: pandas has written an entire sublanguage for getting items that is passed as a string to get around the ugliness of the Python syntax for getting items. I think it’s pretty obvious why it’d be better to have that logic in Python.

petersuter · April 1, 2022, 4:00pm

PEP 637 mentions:

Pandas currently uses a notation such as:
>>> df[df.x == 1]
which could be replaced with
>>> df[x=1]

I assume the proposal here would instead use:

df[$.x = 1]

The PEP 637 syntax to me seems to better fit for Python than a special operator. But would PEP 637 be able to handle the example given above:

df[($.column_1 > 5) & ($.column_2 < 10)]

Maybe

df[column1=5:, column_2=:10]

? But that already doesn’t seem so clear and flexible anymore.

I wonder if you could use (already now without changing Python at all) for example a Greek letter like Iota ι instead of a special operator like $, due to Unicode variable support:

ι = df

df[(ι.column_1 > 5) & (ι.column_2 < 10)]

Seems to work. Maybe Pandas could provide a version that works with “the current dataframe”?

CAM-Gerlach · April 1, 2022, 5:29pm

It might be theoretically possible, but probably not at all practical. You’d have to have a special object you’d import from pandas that would emulate the DataFrame API, but instead of performing the operations itself, it overrides the various methods (e.g. __getitem__() for [], __getattr__() for ., etc) to store the requested operations in some sequential structure, and then return an instance of the same mock class to do the same for any subsequent. Then, in the parent dataframe’s __getitem()__, it would have to recognize this special object, read the data, construct a query from that, call .query() on it and return that.

There’s also the issue that if you’re not instantiating the special object, either the first call to it would have you, or you’d have to make it a singleton, which would be a nightmare given it necessarily has variable state. And what about multiple instances of it in the same object, or calling functions, etc? There’s lots of cases that wouldn’t function with this without even more work. It would be a huge mess…may as well just use a string query() at that point.

And that aside, I don’t see using a Greek letter as much of an improvement, as it would be enough of a pain to type that there wouldn’t be as much net benefit, at least without special setup/IDE support, in which case you can just autocomplete the df name anyway (which is way simpler than making this approach work).

steven.daprano · April 1, 2022, 6:26pm

Regarding PEP 637, I too am very disappointed that it was rejected.

Perhaps if the pandas community would stand up and make themselves heard, they would be less unrepresented and the PEP could be reconsidered in the future. The steering council cannot read people’s mind, if the pandas community wants a feature, they have to lobby for it, and sadly for PEP 637 there was too much negative feedback from the usual core of conservative “Don’t touch my Python!” contingent and not enough, or zero, from people who might have actually used this.

But putting aside my disappointment over PEP 637, I think that it actually wouldn’t help for this specific proposal. I don’t think that there is any nice, obvious mapping between keyword subscripts:

dataframe[spam=value, eggs=value]

and the existing:

dataframe[(dataframe.column_1 > 5) & (dataframe.column_2< 10)]

But then its 5am local time and maybe I’m missing something obvious

Pandas already has a mini-interpreter for that query language, so perhaps a variant on the query method would be good enough?

dataframe[query="column_1 > 5 and column_2 < 10"]

For what its worth, I sympathise with the problem but I hate the proposed use of $ as sugar here

If all you want is to reduce the length of the name, you can just use an alias:

d = dataframe
d[(d.column_1 > 5) & (d.column_2< 10)]

and it works! And also misses the point.

The actual point is that pandas has had to develop not one but two entire mini-languages to pass the unevaluated expressions, one using strings and one using whatever the output type of dataframe.column_1 > 5 etc return.

Think big: what we want is a way to tell the interpreter to delay evaluation of the subscript, so that the dataframe can then evaluate it within its local context rather than the caller’s context.

(I think.)

From a usability perspective, I think what we want to write is this:

dataframe[column_1 > 5 and column_2 < 10]

but currently the subscript expression is evaluated in the caller’s locals, and you likely get a NameError. If you didn’t get a NameError, you get the local (or global) column_1 variable, rather than the column_1 variable from the dataframe namespace.

So I think that, thinking big, real big, what we want is to:

delay the evaluation of the subscript column_1 > 5 and column_2 < 10;
pass that delayed expression into a dataframe method; and
allow the dataframe method to interpret the variables column_1 and column_2 as dataframe attributes rather than local or global variables.

Am I close?

(This may be too big*) starting where Python currently is, but we can ask for the world and work backwards to something actually doable.)

What if we had syntax to delay the evaluation of an expression? For the sake of having some syntax, let’s use backticks:

`column_1 > 5 and column_2 < 10`

returns a function-like object that could be evaluated inside whatever context the method likes:

dataframe[`column_1 > 5 and column_2 < 10`]

calls:

arg = callable delayed expression
dataframe.__getitem__(arg)

which the method can then do:

if isinstance(arg, DelayedExpression):
    result = arg(locals=vars(self))

or something similar.

Evaluating the DelayedExpression would by default occur in the environment where it is evaluated, not the caller’s environment.

Would that make sense?

petersuter · April 1, 2022, 7:12pm

Instead of storing the information in the singleton, it should return a new object. The rest sounds relatively simple (compared to introducing a special new operator) and is probably close to what Pandas already does, no?

Personally I agree. I would rather type df. than ι. Data science users might have a different taste, see Julia. A new non-ASCII special operator would have the same downside.

It’s probably possible with macropy.

amtrakcipher · April 1, 2022, 7:36pm

Yes, in practice I’ve found that this aliasing is what most people do, and it leads to the bad practice of giving every dataset or array an uninformative one or two character name. It’s a bad habit a lot of people pick up if they enter Python through these ecosystems.

I think this is a very nice description for pandas, where data is organized hierarchically. I’ve also found this pattern present in numpy code, though, and I don’t think the implied solution is the same. Consider

some_matrix_variable[(some_matrix_variable < 0) | np.isnan(some_matrix_variable)] = default_value

Sure, numpy could have its own delayed evaluation language with a self reference symbol, but it’s up to the community to adopt coherent conventions (or not). (To be fair, they’ve mostly done this for boolean indexing).

More broadly, my idea is influenced by exposure to Scala’s anonymous function syntax, which is (in my personal estimation) the nicest syntax for expressing data transformations I’ve encountered. (What would be written as filter(lambda x: x>0, map(lambda y: y+1, data)) in Python is data.map(_ + 1).filter(_ > 0) in Scala–comprehensions are the better way to do this in Python, but the scientific stack relies on functional approaches for speed).

If you want a bigger proposal, how about expanding $ to be a shortcut for one variable lambdas!

nas · April 1, 2022, 8:48pm

It might be useful to look at what Polars does for lazy queries. As Steven observes, you really want a way of having delay evaluation and passing it to Pandas. Then it can do optimizations like avoiding intermediate copies of the dataframe. R does this using non-standard evaluation. Python doesn’t have such a thing.

brettcannon · April 1, 2022, 9:19pm

The SC actually put in the effort to reach out to people on the pandas team to make sure they were taken into consideration on PEP 637. In the end, pretty much only xarray publicly stated they supported the PEP.

And I will say that one of the first things the SC would do if this proposal became a PEP is ask for testimony from the key science projects if they would find the new feature useful.

stefanoborini · April 2, 2022, 11:48am

As someone else has said, this is a case of non standard evaluation. Python does not do such thing. R does, and R does NSE all the time. Everything you pass to a function is not evaluated there and then, it’s passed down and evaluated when it is needed. I am sure some of you already are familiar with the concept and the implications of such technique from the software design point of view.

However, from my point of view of a developer that have to develop in R after many many python years, I find it really problematic. Having evaluation happen much later makes it really hard to pinpoint the actual source of the error (even a basic typo), because the error will be raised when the expression is evaluated, not where it’s written.
Additionally, the syntax never makes it clear what the scope of the expression will be. Only the called function will decide. It may be the caller, it may be the callee, it may be anything else. This is extremely confusing and opaque.

R goes even further, and allows for complete reinterpretation of the passed expression. In python if you call

foo(a>5)

foo will receive whatever the evaluation of a>5 is. In R. It will receive the expression “a>5” and can choose to evaluate it, or alter it, or parse it in its own unique way. You will never know what you are getting when you are calling a routine in R. And of course this means that in R the above line cannot always be refactored into:

x = a > 5
foo(x)

and you have no indication whatsoever about which routines allow this, and which ones will crash and burn.

So, the idea of using backticks seems appealing. At least you make it explicit that whatever it contains won’t be evaluated there and then, but instead passed as an “expression object” for later evaluation. But is it really worth it? it’s basically a shortcut for lambda, and you still don’t have the flexibility of saying in which scope that expression will be evaluated.

CAM-Gerlach · April 2, 2022, 4:49pm

As an alternate approach that wouldn’t require new syntax or other language/CPython-level changes, what about pandas simply treating string slices that don’t match a column name as expressions to be passed to query?

guido · April 2, 2022, 5:38pm

I’d beware of too much overloading. Since there’s already a method to do that, it’s probably unnecessary to also provide an overload.

Anyway, we are not the experts on how to design the pandas API. We should leave that to the pandas team. We might be able to design a language feature in collaboration with them, but for that we’d have to explicitly seek that collaboration. In the past this has led to the matmul operator (@).

Melendowski · April 3, 2022, 10:20am

My experience with pandas issue board is that they’re

1.) Have an aversion to current api changes (understanding as it’s hugely popular project with a lot of users)
2.) Have an aversion to new api (they admit their api huge and confusing, there’s [] .loc .iloc .at .iat .query .xs off of the top of my head for indexing). They don’t want it bigger

As far as the delayed evaluation is concerned. [] .loc .iloc I listed in point 2 are overloaded to accept callables

.loc

A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)

I imagine points 1 and 2 as well as the current methods are they didn’t care too much for the pep handling keywords in __getitem__.

With that being said an object that overloaded the arithmetic and comparison operators and returned closures is something people have played with (and doesn’t interfere with Pandas api). I also think a more general approach to this is something that has been discussed on the one of the python mailing lists. I think it was in the topic referring to late binding of function arguments somehow devolved into talking about it.

cben · April 3, 2022, 4:26pm

I know very little about pandas but just wanted to say of pros and cons of any “query language” are incomplete if you don’t consider:

Is the query easy to introspect/optimize/compile to CPU, GPU and/or parallel compute?

Numba/CuPy/cuDF exist, and resort to pretty crazy stuff like python bytecode introspection !
In principle, a query-string notation — despite requiring a parser — is much saner for such work
A syntactic transform foo[$.bar < 5] ⟶ foo[foo.bar < 5] doesn’t help.
A similar-looking global with attr/operator overloading could help by building up AST-like representation of the query: foo[DF.bar < 5] ⟶ foo[LessThan(Column('bar'), 5)]. I don’t know whether the already mentioned python-paddles works like that under the hood, just pointing out that kind of API could do it.

But I can’t comment on specific situation & goals for SciPy ecosystem.

Is it easy to build/perform partial queries and compose them?

This is where string notations clearly lose (cf. any SQL query builder).

Indexing by boolean mask / callables lets you extract variables or functions:

condition_1 = dataframe.column_1 > 5
return dataframe[condition_1 && get_condition_2(dataframe)]

So does the overloaded-object approach.

hlovatt · April 6, 2022, 8:30pm

I think short lambda syntax along the lines @amtrakcipher suggested is the way to go, because it has much wider applicability. Also it satisfies the criteria of @cben to allow execution on GPUs.

The original example would actually be unchanged if $ could be used as the marker for a lambda:

dataframe[$.column_1 > 5]

Would be the same as:

dataframe[lambda x: x.column_1 > 5]

I’ve not studied the grammar to know if above is possible!

guido · April 6, 2022, 9:50pm

That syntax is possible, no need to worry about that.

But wasn’t one of the requirements that the expression can be introspected? Introspecting a lambda isn’t straightforward, and in general may not be possible (it would require disassembling the bytecode). If you have a line number it could be possible to re-read the source file and take that, assuming the file wasn’t edited in the meantime.

Syntactic sugar for parent in __setitem__ and __getitem__ to improve boolean indexing

Is the query easy to introspect/optimize/compile to CPU, GPU and/or parallel compute?

Is it easy to build/perform partial queries and compose them?

Syntactic sugar for parent in setitem and getitem to improve boolean indexing