A lot of code in data science applications using pandas/pyspark/dask/scikit-learn etc. is heavily using method-chaining to do pipelining, which is natural in many cases, and necessary in the case of pyspark (where the query plan will only be created at the end, and may substantially affect performance).
I think the currently possible syntax-options for pipelining are suboptimal in python, and I’ve been following the inclusion of the new pegen-parser with high interest, not least because I believe it allow (at least in principle) to improve this situation.
I’m going to take a generic pyspark-example to illustrate the various different current options, as well as two separate but related improvements I’d like to propose:
- Allowing line-continuation by indentation, without a
\
-marker - Amending PEP8 to allow “dot-alignment”
Please note that all my comments about parsers and their capabilities & limitations are on a best-effort basis, and I make no claim to speak authoritatively.
Prelims
from pyspark.sql import Window
import pyspark.sql.functions as F
some_actual_value = 1000
Note that importing pyspark.sql.functions
with an alias is important, because it otherwise overloads builtins like max
& min
.
IIUC, before python 3.9, the syntax was limited by the old LL(1)-parser, which didn’t allow looking ahead for more than one token. This means that a line continuation needed to be indicated by a special token (\
), and crucially that no comments between lines are possible with this.
Basic pipelining with `\`; no comments possible
last_status = cases \
.withColumn("rank", F.row_number.over(
Window.partitionBy("case_id").orderBy(F.desc("some_timestamp")))) \
.filter(F.col("rank") == 1) \
.select(
F.col("case_id"),
F.col("is_archived").alias("is_archived_latest"),
F.when((F.col("is_archived") == F.lit(True))
| (F.col("some_value") <= F.lit(some_actual_value)),
# if archived or [whatever], set some_value to zero
F.lit(0))
.otherwise(F.col("some_value"))
.alias("some_value"))
This limitation is IMO unacceptable in most cases (especially for high-complexity collaborative code), because documenting the pipelined code is essential.
One way to enable comments nevertheless is to give the parser “guardrails” in the form of brackets. Due to the PEP8 indentation rules, this must either be fully aligned with the first bracket, or break into a new line. While it’s not the end of the world, it becomes cumbersome (IMHO) to wrap stuff in brackets as soon as you need a comment, aside from the higher density of brackets which make digesting the code even harder.
Using brackets; no extra newline
last_status = (cases
# per case: sort by some_timestamp
.withColumn("rank", F.row_number.over(
Window.partitionBy("case_id").orderBy(F.desc("some_timestamp"))))
# restrict to last date
.filter(F.col("rank") == 1)
.select(
F.col("case_id"),
F.col("is_archived").alias("is_archived_latest"),
F.when((F.col("is_archived") == F.lit(True))
| (F.col("some_value") <= F.lit(some_actual_value)),
# if archived or [whatever], set some_value to zero
F.lit(0))
.otherwise(F.col("some_value"))
.alias("some_value")
))
Using brackets; with extra newline
last_status = (
cases
# per case: sort by some_timestamp
.withColumn("rank", F.row_number.over(
Window.partitionBy("case_id").orderBy(F.desc("some_timestamp"))))
# restrict to last date
.filter(F.col("rank") == 1)
.select(
F.col("case_id"),
F.col("is_archived").alias("is_archived_latest"),
F.when((F.col("is_archived") == F.lit(True))
| (F.col("some_value") <= F.lit(some_actual_value)),
# if archived or [whatever], set some_value to zero
F.lit(0))
.otherwise(F.col("some_value"))
.alias("some_value"))
)
Using brackets; black'd
last_status = (
cases
# per case: sort by some_timestamp
.withColumn(
"rank",
F.row_number.over(
Window.partitionBy("case_id").orderBy(F.desc("some_timestamp"))
),
)
# restrict to last date
.filter(F.col("rank") == 1).select(
F.col("case_id"),
F.col("is_archived").alias("is_archived_latest"),
F.when(
(F.col("is_archived") == F.lit(True))
| (F.col("some_value") <= F.lit(some_other_value)),
# if archived or [whatever], set some_value to zero
F.lit(0),
)
.otherwise(F.col("some_value"))
.alias("some_value"),
)
)
Personally, I believe it would improve the daily lives of many people who work with these (extremely widespread) libraries, if it were possible to write something as follows:
last_status = cases
# per case: sort by some_timestamp
.withColumn("rank",
F.row_number.over(Window.partitionBy("case_id")
.orderBy(F.desc("some_timestamp"))))
# restrict to last date
.filter(F.col("rank") == 1)
.select(
F.col("case_id"),
F.col("is_archived").alias("is_archived_latest")
F.when((F.col("is_archived") == F.lit(True))
| (F.col("some_value") <= F.lit(some_actual_value)),
# if archived or [whatever], set some_value to zero
F.lit(0))
.otherwise(F.col("some_value"))
.alias("some_value"))
I believe this is clearer to read, reduces the bracket/symbol/indentation-density, and fills a gap that other languages in this space currently do better than python (e.g. R / scala; the magrittr-pipe does not need extra brackets).
AFAIU, this would need the full LL(k)-capabilities of the pegen parser, because obviously, there could be arbitrarily many lines of comments in between actual code lines.
The keen-eyed will notice that the above example already includes the dot-alignment suggestion (point 2. above) as well, which would also enable to following alternative:
Alternative without newline; using dot-alignment
last_status = cases.withColumn("rank",
# per case: sort by some_timestamp
F.row_number
.over(Window.partitionBy("case_id")
.orderBy(F.desc("some_timestamp"))))
# restrict to last date
.filter(F.col("rank") == 1)
.select(
F.col("case_id"),
F.col("is_archived").alias("is_archived_latest")
F.when((F.col("is_archived") == F.lit(True))
| (F.col("some_value") <= F.lit(some_actual_value)),
# if archived or [whatever], set some_value to zero
F.lit(0))
.otherwise(F.col("some_value"))
.alias("some_value"))
Personally, I’d like to see both as possible, but for me the much bigger win would be 1., whereas 2. is a (even more) cosmetic improvement.