Possibilities for improve pipelining syntax with new PEG parser

A lot of code in data science applications using pandas/pyspark/dask/scikit-learn etc. is heavily using method-chaining to do pipelining, which is natural in many cases, and necessary in the case of pyspark (where the query plan will only be created at the end, and may substantially affect performance).

I think the currently possible syntax-options for pipelining are suboptimal in python, and I’ve been following the inclusion of the new pegen-parser with high interest, not least because I believe it allow (at least in principle) to improve this situation.

I’m going to take a generic pyspark-example to illustrate the various different current options, as well as two separate but related improvements I’d like to propose:

  1. Allowing line-continuation by indentation, without a \-marker
  2. Amending PEP8 to allow “dot-alignment”

Please note that all my comments about parsers and their capabilities & limitations are on a best-effort basis, and I make no claim to speak authoritatively.

Prelims
from pyspark.sql import Window
import pyspark.sql.functions as F

some_actual_value = 1000

Note that importing pyspark.sql.functions with an alias is important, because it otherwise overloads builtins like max & min.

IIUC, before python 3.9, the syntax was limited by the old LL(1)-parser, which didn’t allow looking ahead for more than one token. This means that a line continuation needed to be indicated by a special token (\), and crucially that no comments between lines are possible with this.

Basic pipelining with `\`; no comments possible
last_status = cases \
    .withColumn("rank", F.row_number.over(
        Window.partitionBy("case_id").orderBy(F.desc("some_timestamp")))) \
    .filter(F.col("rank") == 1) \
    .select(
        F.col("case_id"),
        F.col("is_archived").alias("is_archived_latest"),
        F.when((F.col("is_archived") == F.lit(True))
               | (F.col("some_value") <= F.lit(some_actual_value)),
               # if archived or [whatever], set some_value to zero
               F.lit(0))
        .otherwise(F.col("some_value"))
        .alias("some_value"))

This limitation is IMO unacceptable in most cases (especially for high-complexity collaborative code), because documenting the pipelined code is essential.

One way to enable comments nevertheless is to give the parser “guardrails” in the form of brackets. Due to the PEP8 indentation rules, this must either be fully aligned with the first bracket, or break into a new line. While it’s not the end of the world, it becomes cumbersome (IMHO) to wrap stuff in brackets as soon as you need a comment, aside from the higher density of brackets which make digesting the code even harder.

Using brackets; no extra newline
last_status = (cases
               # per case: sort by some_timestamp
               .withColumn("rank", F.row_number.over(
                   Window.partitionBy("case_id").orderBy(F.desc("some_timestamp"))))
               # restrict to last date
               .filter(F.col("rank") == 1)
               .select(
                   F.col("case_id"),
                   F.col("is_archived").alias("is_archived_latest"),
                   F.when((F.col("is_archived") == F.lit(True))
                          | (F.col("some_value") <= F.lit(some_actual_value)),
                          # if archived or [whatever], set some_value to zero
                          F.lit(0))
                   .otherwise(F.col("some_value"))
                   .alias("some_value")
               ))
Using brackets; with extra newline
last_status = (
    cases
    # per case: sort by some_timestamp
    .withColumn("rank", F.row_number.over(
        Window.partitionBy("case_id").orderBy(F.desc("some_timestamp"))))
    # restrict to last date
    .filter(F.col("rank") == 1)
    .select(
        F.col("case_id"),
        F.col("is_archived").alias("is_archived_latest"),
        F.when((F.col("is_archived") == F.lit(True))
               | (F.col("some_value") <= F.lit(some_actual_value)),
               # if archived or [whatever], set some_value to zero
               F.lit(0))
        .otherwise(F.col("some_value"))
        .alias("some_value"))
)
Using brackets; black'd
last_status = (
    cases
    # per case: sort by some_timestamp
    .withColumn(
        "rank",
        F.row_number.over(
            Window.partitionBy("case_id").orderBy(F.desc("some_timestamp"))
        ),
    )
    # restrict to last date
    .filter(F.col("rank") == 1).select(
        F.col("case_id"),
        F.col("is_archived").alias("is_archived_latest"),
        F.when(
            (F.col("is_archived") == F.lit(True))
            | (F.col("some_value") <= F.lit(some_other_value)),
            # if archived or [whatever], set some_value to zero
            F.lit(0),
        )
        .otherwise(F.col("some_value"))
        .alias("some_value"),
    )
)

Personally, I believe it would improve the daily lives of many people who work with these (extremely widespread) libraries, if it were possible to write something as follows:

last_status = cases
    # per case: sort by some_timestamp
    .withColumn("rank",
                F.row_number.over(Window.partitionBy("case_id")
                                        .orderBy(F.desc("some_timestamp"))))
    # restrict to last date
    .filter(F.col("rank") == 1)
    .select(
        F.col("case_id"),
        F.col("is_archived").alias("is_archived_latest")
        F.when((F.col("is_archived") == F.lit(True))
               | (F.col("some_value") <= F.lit(some_actual_value)),
               # if archived or [whatever], set some_value to zero
               F.lit(0))
         .otherwise(F.col("some_value"))
         .alias("some_value"))

I believe this is clearer to read, reduces the bracket/symbol/indentation-density, and fills a gap that other languages in this space currently do better than python (e.g. R / scala; the magrittr-pipe does not need extra brackets).

AFAIU, this would need the full LL(k)-capabilities of the pegen parser, because obviously, there could be arbitrarily many lines of comments in between actual code lines.

The keen-eyed will notice that the above example already includes the dot-alignment suggestion (point 2. above) as well, which would also enable to following alternative:

Alternative without newline; using dot-alignment
last_status = cases.withColumn("rank",
                               # per case: sort by some_timestamp
                               F.row_number
                                .over(Window.partitionBy("case_id")
                                            .orderBy(F.desc("some_timestamp"))))
                   # restrict to last date
                   .filter(F.col("rank") == 1)
                   .select(
                       F.col("case_id"),
                       F.col("is_archived").alias("is_archived_latest")
                       F.when((F.col("is_archived") == F.lit(True))
                              | (F.col("some_value") <= F.lit(some_actual_value)),
                              # if archived or [whatever], set some_value to zero
                              F.lit(0))
                        .otherwise(F.col("some_value"))
                        .alias("some_value"))

Personally, I’d like to see both as possible, but for me the much bigger win would be 1., whereas 2. is a (even more) cosmetic improvement.

2 Likes

Some quick comments here from someone who has never used any of these packages:

  1. This really looks like it could use a dedicated query language instead of making do with magic attributes and methods.

  2. Is typing the extra pair of parentheses really such a hassle that you want to change the language?

  3. Can you think of a PEG grammar that would make this possible? Start with python.gram

1 Like

Thanks a lot for taking a look! :slight_smile:

Responding to your points:

  1. I can’t claim to speak what the whole ecosystem wants, but creating a separate language, with separate syntax and rules etc. is probably going to face an uphill battle, because things work quite well already. See e.g. how this “modern pandas”-post from 2016 recommends method-chaining as best-practice (and it is used a lot like that).

  2. I get that it’s not make-or-break situation. But those brackets come with extra indentation requirements (assuming one uses a code linter), and are just a constant paper cut in developing and maintaining such code bases, for the reasons I outline in the OP.

  3. I apologise that my understanding of PEG and the current grammar is rudimentary. But AFAIU, it should be possible (in a limited manner) with something very roughly like the following:

@@ -614,6 +614,9 @@ t_primary[expr_ty]:
                  (b) ? ((expr_ty) b)->v.Call.keywords : NULL,
                  EXTRA) }
     | a=atom &t_lookahead { a }
+    | a=atom b=t_dot_continue+ { _PyPegen_concatenate_accessors(p, a, b) }
+t_dot_continue[expr_ty]:
+    | NEWLINE INDENT [TYPE_COMMENT NEWLINE INDENT]* &'.' a=atom { a }
 t_lookahead: '(' | '[' | '.'
 t_atom[expr_ty]:
     | a=NAME { _PyPegen_set_expr_context(p, a, Store) }

Re (1) and (2), I wonder if the true culprit isn’t your formatter/linter. The code examples shown in that “modern pandas” post look eminently reasonable with how the code is laid out, making a minimal amount of fuss about the extra parentheses. I know Black disagrees, but that’s the aspect of Black I dislike the most myself, and maybe it’s time to part ways with Black, rather than trying to get the language changed (more about that below).

Re (3), I think that’s not quite enough (you picked a corner of the grammar that’s only used for assignment targets, i.e. what’s to the left of =), but there’s probably something similar possible in the primary target. However, I just tried a quick prototype of just some grammar modifications, and it caused the REPL to wait for an extra blank line after a simple assignment. So there’s more work if we really wanted to go this way.

But.

Changing the language is a huge effort – it’s not a matter of just figuring out how to change the parser. You have to get all the tools to understand the new format, including auto-indenting in dozens of popular editors, everybody’s favorite static checker, and so on. And then you have to update the documentation – not just docs.python.org, but also popular books and websites.

And then you have to way years before you finally can use it, because it takes a long time before people are willing to say “this code no longer supports Python 3.9”.

Compared to that, forking Black and changing how it handles parenthesized multi-line expressions seems child’s play.

1 Like

I’m not worrying about black TBH, but something like flake8, which is pretty standard (for those who find black too invasive). But let’s for the moment assume that it would be a real and measurable (if small) improvement to the language experience - right now I’m the only here, but there are many people I work with who have expressed interest in something like this.

I 100% understand that language changes are a lot of effort, but since it is not changing existing syntax, merely adding another way that’s currently not possible, I believe the effort is tractable, and ultimately, worth the cost.

Also, every journey needs to start with a first step, and since I see myself using python for a long time to come, I don’t mind waiting (plus, it’s not aimed as much at library authors - who need to support many python versions - but could be used by those who want it as soon as it appears in a python version. So there’s no need to wait until 3.9 is unsupported).

Under the assumption that the grammar issues can be overcome, what would be some next steps?

  • Gather more feedback
  • Try to get a POC working
  • Write a PEP

WDYT?

1 Like

I have done everything I can to discourage you. It’s now up to you to see if you can get others to support you. Getting someone to help with a working PoC would be the next step to ensure you’re not fooling yourself into thinking this is easy.

To add my support to what @guido is saying, I don’t think this is worth a language change. If we had a time machine (or Guido lent us the keys to his :slightly_smiling_face:) then maybe with hindsight this is something that could have been in the language from the start, but the transitional costs are simply not worth it from where we are now IMO.

As @guido noted, it will be a long time before code can widely say “Python 3.10+ only”, and by the time that happens, it’s quite possible that the “modern Pandas” style may no longer be the preferred idiom, and something else has superseded it.

On the other point you mentioned, code formatters and linters, like black and flake8, should help people code, not get in the way. If a tool is blocking you from using a perfecly legal Python syntax that expresses your intent clearly (like the “modern Pandas” style does) then surely it’s the tool that needs fixing, and not the language? After all, if you did get the language changed, the tools would have to be updated to allow the new syntax - so getting them changed to match a currently legal idiom should be less work, not more. (Either that, or there’s a deeper problem somewhere, that isn’t really anything to do with language changes or code styles…)

If change the language, I would prefer to make it possible to express the above example (or the large part of it) in a comprehension-like style:

last_status = select(
    (case_id,
     is_archived as is_archived_latest,
     0 if (is_archived == True or some_value <= some_actual_value)
     else some_value
    )
    for rank, case_id, is_archived, some_value in cases
    partition by case_id
    order by some_timestamp
    if rank == 1
)

But it is fantastic.

Thanks for your input!

Even though it’s going into semantic details, I’d say this isn’t a language change as much as a language addition. As such, the changes can take as much time as they need to percolate through the ecosystem (linters, IDEs, etc.), without anyone paying an undue price.

And as I mentioned above, application code (as opposed to library code), doesn’t have to care about being able to widely say “Python 3.10+ only”, it only needs the availability of Python 3.10 and enough arguments to upgrade your python version.

I’ll think some more about this, but I think the core idea is sound. Maybe I’ll give it a shot myself at trying to implement it eventually.

1 Like

I don’t think it’s a reasonable goal to recreate another SQL-like query language. And part of the reason the current method chaining works so well for pandas/pyspark etc., is that it solves some big deficiencies of SQL:

  • it’s much easier to iterate with atomic increments (e.g. in a REPL or notebook)
  • it’s much easier to write code that keeps referring to itself or intermediate quantities

Furthermore, I’m not sure that all transformations that make sense from DataFrame to DataFrame can actually be expressed from the POV of such a query language - it may be possible, but the pandas API is huge, and I just chose a simple example.

2 Likes

You originally mentioned updating PEP8 to allow for-alignment. PEP8 is a (proposal for a) style-guide for standard-library code, and third-party code does not have to follow PEP8 (for example, check out Google’s style guide).

Furthermore, from my experience, Python doesn’t feel like a language you would want much method-chaining (what you call pipelining*). I think you could do the same thing functionally (both passing the result of one function into the next, or building up a function by passing functions into functions (lambda)), or calling methods on a TableFilter instance. The latter feels the most explicit as to what you’re doing (and explicitly is Zen), however Python is great in allowing all of these methods for different communities which are used to different programming styles.

From the challengers, is there any way that the suggested changes to the syntax would break any existing code?

*For me, pipelining is the idea of running tasks non-sequentially, eg when you need to process a bunch of files, you can load one file, then process that file and at the same time load the next file, then save the first file and process the second file and load the third, etc.

1 Like

No, that’s not the problem. (In general, if a proposal breaks existing code, it’s dead in the water, and I wouldn’t have had to respond with actual arguments. :slight_smile:

Which reminds me, the OP wrote

Again, that’s the wrong way to think about changes to Python. Proposals that change existing syntax are in general unacceptable. All proposals are required to maintain backwards compatibility, which means adding new syntax is the only possible change.

1 Like

What about interactive REPL usage?
The built-in REPL reads lines one by one uses essentially two heuristics to know when a statement/block is complete and ready to run:

  1. if a physical line can parse as a complete logical line, it is;
  2. a blank line closes all indented blocks

(1) is motivated by making simple use cases Just Work:

>>> print("Hello")Enter
Hello
>>> 2+2Enter
4

I believe (1) has so far informed the design of the grammar?
There are no present situation when a line looks like a complete statement but you need lookahead to figure it isn’t.
IMHO it’s also a nice property for humans reading the code.

(2) is actually a deviation from the grammar of .py files! It was not an obstacle to make the full grammar better (blank lines allowed in functions and other blocks, by lookahead for next non-blank non-comment line, or EOF).

I realize people who write code like this are likely to use a REPL with multi-line editing support like ipython or Jupyter.
Still, the proposal needs to say how the builtin REPL will work.
What happens when you type:

>>> last_status = casesEnter

Is it immediately executed? (so the new style without parentheses won’t work in REPL)
Do you get a continuation prompt:

...    # per case: sort by some_timestampEnter
...    .withColumn("rank", ...)Enter
...Enter

But in the latter case, do you need EnterEnter after every single statement?

Personally, I’d be happy for the answer to be “add separate keys for exectution vs continuation to builtin REPL” and/or “add real multi-line editing to builtin REPL”.
What about dumb terminal support? Is it a requirement?
Consider this:

$ echo 'if 1:
    print(1)
  
    print(2)' | python3.9 -i
1
2
$ echo 'if 1:
    print(1)
  
    print(2)' | python3.9 -i
Python 3.9.0a6 (default, May  5 2020, 18:42:59) 
[GCC 9.3.1 20200408 (Red Hat 9.3.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> ... ... 1
>>>   File "<stdin>", line 1
    print(2)
IndentationError: unexpected indent
>>> 

So when input is from a pipe, python uses non-interactive grammar with lookahead! But you can force a REPL with -i, which uses the heuristics (1) (2).

1 Like

Small corrections to above piped stdin examples:
The last example only prints an error when the blank line is actually empty.
Let’s make whitespace clear by quoting every line:

(
  echo 'if 1:'
  echo '    print(1)'
  echo ''
  echo '    print(2)'
) | python3.9 -i
Python 3.9.0a6 (default, May  5 2020, 18:42:59) 
[GCC 9.3.1 20200408 (Red Hat 9.3.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> ... ... 1
>>>   File "<stdin>", line 1
    print(2)
    ^
IndentationError: unexpected indent
>>> 

If the blank line contains space(s), as I mistakenly pasted in previous post, rule (2) doesn’t kick in:

(
  echo 'if 1:'
  echo '    print(1)'
  echo ' '
  echo '    print(2)'
) | python3.9 -i
Python 3.9.0a6 (default, May  5 2020, 18:42:59) 
[GCC 9.3.1 20200408 (Red Hat 9.3.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> ... ... ... ... 
1
2
>>> 

(Obviously, this distinction is too subtle for parsing files! It’s just a minimal kludge giving special meaning to consecutive EnterEnter… Too make this extra confusing, ipython and IDLE do allow blank line with whitespace to terminate a block — probably because they auto-indent after the first Enter.)


And the first example was supposed to be without -i:

(
  echo 'if 1:'
  echo '    print(1)'
  echo ''
  echo '    print(2)'
) | python3.9
1
2

For the REPL to work, a possible way of treating a line starting with .some_method is to prefix it with last_assigned_var = last_assigned_var.
In this way,

>>> last_assigned_var = something
>>>     .method_1()
>>>     .method_2()
>>>     .method_3()

is just a syntax sugar of

>>> last_assigned_var = something
>>> last_assigned_var = last_assigned_var.method_1()
>>> last_assigned_var = last_assigned_var.method_2()
>>> last_assigned_var = last_assigned_var.method_3()

With this implementation, each line starts with . can be immediately executed with no problem.

In a similar way, if the last line is an expression rather than an assignment, we can treat

>>> something
>>>     .method_1()
>>>     .method_2()
>>>     .method_3()

as this

>>> something
>>> _.method_1()
>>> _.method_2()
>>> _.method_3()

As much as a love method chaining a pandas.DataFrame it’s also a huge nightmare to debug and what happens often is I have to place a breakpoint before the starting line and manually copy paste from start to each attribute access to see what’s going on inside…which makes me hate it. Reading and writing it, I love it, debugging it, I hate it.

Id imagine a lot of people are mixed like this. It’s even discussed in this tutorial.

Unless there’s some special support for that in debuggers and tools, I don’t see any benefit given the massive hurdles that Guido discussed

1 Like

Those are clever approximations, but I doubt they can cover all cases.
Already executed code may have unacceptable side effects.

>>> spaceship_doors.open += [door1, door2]
>>>    .filter(lambda d: d.depressurizing_is_safe_for(human_locations))

If spaceship_doors defines setattr to .open, or spaceship_doors.open overloads += operator to actually open doors, or even if that were a regular variable holding a regular list yet another thread is periodically checking it and applying it in reality — it’s possible for first line to kill the passengers, before you type in the 2nd line to stop it.

The above is bad software design, but I don’t think it’s a strawman argument for the level of language design: if we can’t guarantee they are always equivalent, it means programmers would have to learn the subtle differences between semantics in file vs. REPL. Is then a REPL syntax that mostly approximates file behavior helpful, or harmful?

  • E.g. maybe it’s better to at least make it a SyntaxError in REPL, and have programmers consciously write out _.method_1() etc. — then they at least understand the first line already run before 2nd comes in.
    Though the error can only come on 2nd line, too late for the spaceship passengers, but programmer will be careful for next time?

[To some degree, this is already a risk when pasting long code with blank lines. But that’s already a given, and doesn’t mean it’s worth complicating it further.]

IMHO, while not quite same, JavaScript’s “automatic semicolon insertion” is relevant prior art – it shows that complex rules for “when is a statement finished” can cause lots of debates & confusion. E.g. JS style guides advising “always use semicolons and you won’t need to learn those rules” had a point, but weren’t tight.