Introduce funnel operator i,e '|>' to allow for generator pipelines

If it was a binary operator, and not DSL, then intermediate value capture solution can be:

result = ((sumneg := (summ :=
    [1, 2, 3]
    |> sum)
    |> neg)
    |> partial(add, 1)
)
print(summ)     #  6
print(sumneg)   # -6
print(result)   # -5

Arguably not as neat as early proposals of as (and similar) inside DSL-like statement, but I quite like it.

It appropriately makes use of walrus and doesn’t clutter the pipe itself placing all assignments at the top.

The order seems reversed, but is consistent with main assignment, as:

(sumnegadd := (sumneg := (summ :=
    [1, 2, 3]
    |> sum)
    |> neg)
    |> partial(add, 1))

Note, this only works if pipe is spitting out actual results and not wrapper objects to be unwrapped in the end. E.g. None of user pipe implementations can make use of this - not pipe-processor-function, not pipe-function composition, not instantaneous pipe.

Infix operator could, but would anyone be satisfied with infix (for anything)?


I was very excited once I first discovered possibility of infix-operators, but I haven’t found a single use case for them. They just don’t stick - hefty concept and not pretty looking syntax. Maybe it would be more attractive if they were recognised in text editors and painted in single color, but that doesn’t make much sense.

Just to pull this thread a little, PEP 671 proposed -> as a syntax for typing.Callable and included notes on why this suggests that => is a good digraph for lambdas.

Even though 671 was eventually rejected, the => lambda syntax wasn’t part of that proposal and may be worth pursuing.

Based on that idea, a sample pipeline:

range(100) |> (xs) => (x + 1 for x in xs) |> sum

I have written it with both the parameters parenthesized, on the assumption that => would require it.
The generator expression is parenthesized only for readability.

I’ll say that this makes me think that a pipeline operator’s precedence rules are significant. It’s not clear if the right hand side of that lambda includes the pipeline or not. i.e. My example parses as one of these three:

# left-associative, higher or lower precedence than `=>` doesn't matter
(range(100) |> (xs) => (x + 1 for x in xs)) |> sum
# right-associative, lower precedence than `=>`
range(100) |> (((xs) => (x + 1 for x in xs)) |> sum)
# right-associative, higher precedence than `=>`
range(100) |> (xs) => ((x + 1 for x in xs) |> sum))

It doesn’t impact the semantics in this case, but I’m sure we can concoct an example in which it does become significant. If the behavior of the pipe itself can be controlled via a dunder method, such examples become trivial to create.


I appreciate the scipy, pytorch, and tensorflow examples, since they are all real code where we can observe the impact.

Keeping in mind that this is just a personal aesthetic judgement, I think the scipy and tensorflow examples are both made much worse to read by the addition of the new syntax.
The pytorch one is quite bad either way, but I don’t think it’s improved by use of pipes.

In the pytorch example, there are a large number of operations being composed, but neither the composition of the operations nor the intermediary results are being named. Not only are there not named variables or functions, but there are no comments. Maybe for a domain expert this code is perfectly clear, but to me it looks like poor practice.


I’d like to look carefully at that scipy example again, to see how it could be rewritten such that it has improved clarity.

weight = max(
    np.max(np.abs(array[np.isfinite(array)]), initial=1.0)
    for array in arrays
)

I have no domain knowledge, so I’ll just write this with f as my function name for what seems lke the primary operation which can be extracted. Using => syntax from above,

f = (array) => np.abs(array[np.isfinite(array)])
weight = max(
    np.max(f(array), initial=1.0) for array in arrays
)

Is it improved by combining it with a pipeline?

f = (array) => np.abs(array[np.isfinite(array)])
weight = max(
    array |> f |> (a) => np.max(a, initial=1.0)
    for array in arrays
)

No, I would say that makes it worse. What if we bind that second lambda to the name g?

f = (array) => np.abs(array[np.isfinite(array)])
g = (a) => np.max(a, initial=1.0)
weight = max(array |> f |> g for array in arrays)

For comparison, here that is again without pipelines:

f = (array) => np.abs(array[np.isfinite(array)])
g = (a) => np.max(a, initial=1.0)
weight = max(g(f(array)) for array in arrays)

Compare any of these against the original. Are we actually improving readability using pipes? I’m doubtful.
To me, the best version of this is the second one, in which f is assigned but no pipelines are used.

2 Likes

To me the following* would be easier to grok than the original on first sight:

result = (
  arrays 
  |> partial(map, func=lambda array: array[np.isfinite(array)]
                       |> partial(np.abs, initial=1.0) 
                       |> np.max
                  ) 
  |> max
)

*although for some reason map currently does not take keyword arguments, but that seems like a minor issue.

I have and the most promising I found was expression, which tries to be type safe. I’m not a fan of the pipe function (i find |> cleaner) but if it were to make it into the stdlib I’d take it over the status quo.

Both mentioned libraries, toolz and expression, implement the same pipelining function.

Considering the wide range of functional programming tools (and others) offered by both libraries, I don’t think it’s feasible to support each one with special syntax.


You don’t need one; it works without keyword arguments as well. Also, note that you’ve just created a long lambda expression.

Here’s the version using the hypothetical forward pipe operator with fixed indentation:

result = (
    arrays
    |> partial(
        map, lambda array: array[np.isfinite(array)]
                 |> partial(np.abs, initial=1.0)
                 |> np.max
        )
    |> max
)

Note that long and complex lambdas are neither readable nor maintainable.

Here is what I would recommend:

def max_finite_abs(array):
    finite_values = array[np.isfinite(array)]
    return np.max(np.abs(finite_values), initial=1.0)

weight = max(map(max_finite_abs, arrays))

There is no guarantee that users will follow proper indentation rules, avoid excessive nesting of function calls, correctly indent forward pipe operators, or refrain from writing everything on a single line. These issues fall under coding style. In fact, many of the arguments commonly made in favor of the forward pipe operator could soon be used to critique it if it becomes part of language syntax.

1 Like

While I wasn’t able to review the whole discussion, I’m surprised that pyspark hasn’t been mentioned yet, and there’s been only 1-2 mentions of pandas DataFrames. Working with DataFrames is a prime use-cases for piping various operations (c.f. also R’s magrittr that’s been mentioned on the side).

More to the point, spark now even has a custom extension to venerable old SQL that allows writing queries using |>. Presumably they (and their users) would be very happy to have the same for their pyspark bindings. In any case, this should provide several more avenues for examples and prior art (not everyone may be familiar, but pyspark/databricks is a huge player on the enterprise side of “big data”).

5 Likes

I believe this illustrates the rationale behind introducing pipe syntax:

from pyspark.sql import functions as F

# Load the CSV file into a DataFrame.
df = spark.read.option("header", "true").csv("lineitem.csv")
# Filter the DataFrame based on a column.
filtered = df.filter(df['price'] < 100)
# Group the remaining rows and compute the product, and show the result.
filtered.groupBy("country") \
  .agg(F.sum(lineitem_filtered.price * lineitem_filtered.quantity)
    .alias("total"))
  .show()

This approach supports flexible iteration on ideas. We know that the source data exists in some file, so we can start right away by creating a DataFrame representing that data as a relation. After thinking for a bit, we realize that we want to filter the rows by the string column. OK, so we can add a .filter step to the end of the previous DataFrame. Oh, and we want to compute a projection at the end, so we add that at the end of the sequence.

Many of these users wish SQL would behave more similarly to modern data languages like this. Historically, this was not possible, and users had to choose one way of thinking or the other.

Is the new SQL pipe operator influenced by the functional chaining style popularized by Python libraries like pandas and PySpark?

So I’ve been letting the comments in this thread stew in my brain a little and I think I’ve come to the conclusion this discussion in it’s current form will never come to a conclusion or be implemented.

Here is why.

Reason One: The most useful form of pipes can already be done in python

The two most useful features of being able to declare a pipe (with or without an operator) is to be able to see at a glance the order of operations, how they transform the data, how you can modify them and potentially how that work is scheduled across cores.

As @pf_moore has mentioned you can write a simple pipe function that handles this

result = pipe(data, step_1, step_2)

I think the reason most people opt not to do this is the time and effort to build and more importantly maintain such a construct in your code which is compatible with libraries and functions built without such a feature existing in Python.

The only thing that I can see a “|>” would unlock is the ability to mix expressions with a list of steps in the pipeline

result = x |> [item for i in _] |> step

But I don’t think this is actually a net benefit.

  1. It allows these statements to become hard to read again which defeats the purpose of this proposal
  2. One of the benefits of pipelines as I see it is the ability to declare how that work gets scheduled (Similar to Rx). In some data pipelines you need the ability to balance performance for example how many cores get dedicated to a certain task.

You can either push this logic to where you are doing your pipeline work which makes that code hard to grok, or you can push it into the functions themselves which can make it harder to manage resources beaucse you might miss that one function spins up a bunch of threads with no way to configure it

Allowing immediately executable expressions like this, would prevent you from having patterns where declaring what work should be done and when it should be done (Similar to languages like Halide where the pipeline and the scheduling is split)

results = pipe(data, step_1, step2, scheduler=concurrent.ThreadPoolExecutor(max_workers=4))

Therefore we should just focus on getting a pipe like construct into functools, there is already a discussion and proposed implementation happening for this.

Reason Two: The majority of the things that make pipes cumbersome to use aren’t the lack of syntax to define them but the lack of syntax to define placeholders or functions that can be used in a pipe

Even with Reason 1 basically solved it still makes pipes not as ergonomic as I would like, one problem is there is no agreement where the parameter that accepts the data should live. This makes sense since Python wasn’t built for functional programming and so no convention has emerged beyond a few build in methods like map and filter using the last argument.

This means you need to use a combination of partials and placeholders which can make it just cumbersome enough to write that I wonder if people who would benefit from it would actually use it and also hurts readability.

result = pipe(data, functools.partial(step_1, functools.placeholder, 2), step_2)

You can of course wrap the function call in a lambda for the same effect but lambdas will then strip away any ability for the pipe function to be able to figure out what function the user is trying to call (unlike with partials) meaning if we one day wanted to allow a library author to detect when two functions are being called as part of a pipeline and as concurrent steps then this would simply not be possible (and other such use cases) without providing a pipe aware interface such as “function” and “function_pipeable” (or a decorator that allows them to provide a version the pipe interface can switch to)

result = pipe(data, lambda i: cv2.boostContrast, cv2.saturation) // cv2 can't detect that the two commans are next to each other and configure the functions to happen in place

Therefore I think only way to solve this particular speed bump is to create a new discussion to figure out how we could easily declare such partials. I know this is a discussion that’s happened in python many times and it feels like a large part of this discussion above essentially descended into a full blown debate around what was essentially a new way of declaring partials just not in name

Next Steps:

Therefore I want to suggest some next steps

  • For Reason 1 lets try to land a pipe implementation into functools, Lets contribute to this proposal in the works `functools.pipe` - Function Composition Utility - #50 by mikeshardmind
  • For Reason 2: Continue a new discussion around a simple way to declare partials there are many good ideas by @sadaszewski and I quite like the idea of simply placing a @ prefix to a function call to make it a partial @my_func(1, 2, _). Being able to have literals for declaring partials like this is what unlocked the ability to construct pipelines in languages like Swift which also struggles with inconsistent interfaces which more functional languages like Javascript don’t have (Mainly because in JS you can just call “.apply” on any function)
5 Likes

I am fine to wrap this up. PEP 638 seems the better way for me and a fun exercise to implement. Python is established and opinionated enough to make letting people build their own syntax a more viable option than extending the standard indefinitely. I wonder if the dynamics in the community would allow people to go their own way in this fashion.

4 Likes