Regarding PEP 637, I too am very disappointed that it was rejected.
Perhaps if the pandas community would stand up and make themselves heard, they would be less unrepresented and the PEP could be reconsidered in the future. The steering council cannot read people’s mind, if the pandas community wants a feature, they have to lobby for it, and sadly for PEP 637 there was too much negative feedback from the usual core of conservative “Don’t touch my Python!” contingent and not enough, or zero, from people who might have actually used this.
But putting aside my disappointment over PEP 637, I think that it actually wouldn’t help for this specific proposal. I don’t think that there is any nice, obvious mapping between keyword subscripts:
dataframe[spam=value, eggs=value]
and the existing:
dataframe[(dataframe.column_1 > 5) & (dataframe.column_2< 10)]
But then its 5am local time and maybe I’m missing something obvious
Pandas already has a mini-interpreter for that query language, so perhaps a variant on the query method would be good enough?
dataframe[query="column_1 > 5 and column_2 < 10"]
For what its worth, I sympathise with the problem but I hate the proposed use of $ as sugar here
If all you want is to reduce the length of the name, you can just use an alias:
d = dataframe
d[(d.column_1 > 5) & (d.column_2< 10)]
and it works! And also misses the point.
The actual point is that pandas has had to develop not one but two entire mini-languages to pass the unevaluated expressions, one using strings and one using whatever the output type of dataframe.column_1 > 5
etc return.
Think big: what we want is a way to tell the interpreter to delay evaluation of the subscript, so that the dataframe can then evaluate it within its local context rather than the caller’s context.
(I think.)
From a usability perspective, I think what we want to write is this:
dataframe[column_1 > 5 and column_2 < 10]
but currently the subscript expression is evaluated in the caller’s locals, and you likely get a NameError. If you didn’t get a NameError, you get the local (or global) column_1 variable, rather than the column_1 variable from the dataframe namespace.
So I think that, thinking big, real big, what we want is to:
-
delay the evaluation of the subscript column_1 > 5 and column_2 < 10
;
-
pass that delayed expression into a dataframe method; and
-
allow the dataframe method to interpret the variables column_1
and column_2
as dataframe attributes rather than local or global variables.
Am I close?
(This may be too big*) starting where Python currently is, but we can ask for the world and work backwards to something actually doable.)
What if we had syntax to delay the evaluation of an expression? For the sake of having some syntax, let’s use backticks:
`column_1 > 5 and column_2 < 10`
returns a function-like object that could be evaluated inside whatever context the method likes:
dataframe[`column_1 > 5 and column_2 < 10`]
calls:
arg = callable delayed expression
dataframe.__getitem__(arg)
which the method can then do:
if isinstance(arg, DelayedExpression):
result = arg(locals=vars(self))
or something similar.
Evaluating the DelayedExpression would by default occur in the environment where it is evaluated, not the caller’s environment.
Would that make sense?