Introduce funnel operator i,e '|>' to allow for generator pipelines

The problem here is that you’ve essentially just demonstrated that all of this is perfectly possible right now with existing language features, but it’s not good enough because lambda and partials are verbose and ugly, and people don’t like using them.

However, the arguments that “we need a better lambda” and “we need better partial application” have been made over and over in the past, and never succeeded. I suggest you hunt out all those earlier discussions and work out why they failed, because you’re about to repeat the whole thing, and if you don’t have ways to address the previous showstoppers, you’ll just be wasting your time and everyone else’s.

Personally, I agree - I’d like to see a more concise lambda syntax, and easier partial application. But there has never been any sort of community consensus agreeing with that, and I’m not sure that another round of the debate will change anything.

It’s not “once things fall into place”, but rather “when you address the core dev requests noted in the issue”. Have you done any further work on addressing those requests since the issue was closed? That would be a much more productive use of your time than proposing yet more variants on the same idea…

4 Likes

It is for me.

This is orthogonal. That one is a function composition, while this is instant feed pipe.

Yeah I agree - I’ll have a read through. I actually think lambda syntax is fine.

And even partials wouldn’t really be a deal breaker if the functions that were part of functools had a consistent interface so that the iterable was always in the same position.

in fact technically the way javascript handle partials to allow you to bind arguments to functions passed into a pipe is to give their functions a method for which to construct a partial.

That says to me that the syntax in python is probably fine and if we really wanted it we could make it a method of the function

def foo():
   pass

foo.bind(1) # alias for partial(foo, 1) - self being the function itself.

Feels much of what’s been proposed so far may just be a case of trying to workaround a lack of a convention which other languages have been able to establish amongst their functional methods.

We are where we are, I’m not sure really how you would fix that. Introduce a new futures package with the order of arguments changed ?

For now I’ll see if I can land a very basic pipe method in functiools in @dg-pb 's proposal.

In the topic of examples of the syntax without partials.

To me, currently this proposal (without the special syntax implicit partial/lambda) is best and mostly explained to improve a bit upon

result = transform(result)
result = update(result)
result = tidy(result)
result = verify(result)

where as it can easily be seen, result is some sort of data that is applied many atomic (as in simple) transformations.

This in plain Python does not happen as much, because Python is and must be open to do a lot of stuff in a simple way.
However, this is rather common in code is centered in using libraries with many workflows and many users with different goals.

This is because functions that can be really ‘simple’ sometimes have some processes that the programmer can’t control (encryption certs, specific formats, …) but the library needs.
As an example, sending some data to some server:

import json

from my_server import Connection
from my_server.dtypes import data_t
from my_server.send import (
    verify,
    compress,
    encrypt,
    add_checksum,
)
from my_server.recv import *  # Too long

def send_data(uri: str, data: str | data_t) -> dict:
    with Connection(uri) as conn:
        send_response = (data
            |> verify
            |> compress
            |> encrypt
            |> add_checksum
            |> conn.send
        )

        assert 200 <= send_response.status_code < 300, f"There's been an issue sending {data!r}"

        return (conn.recv(1024)
            |> verify_checksum
            |> decrypt
            |> decompress
            |> verify
            |> json.loads
       )

In the previous example, most calls are both static and come with a valid default configuration, due to the limitations of not using partials.
As context I mentioned this to be part of some library because in many cases it can be possible to write better code to do all that, but maybe this library uses a specific algorithm or a specific hardcoded encryption key/cert that I am better off not worrying about, and which I believe shows the use of this syntax.
However, although IMO this shows this syntax is readable and usable per se, one can see that this limitations can be overcome with some extensions that others have proposed, which would lift the limitations on “mostly static and with valid default arguments”.

3 Likes

I think these generator/loops examples are the wrong place to look for compelling uses of |> (despite the title of this thread). Anything involving map() or filter() can already be done with a comprehension loop, usually more concisely since the arbitrary maps+filters can be done together and the cost of an extra attribute access or method call is just .attribute or .method rather than a whole extra row of |> map(attrgetter("attribute")).

4 Likes

I agree with the first part of the argument. I don’t find it any easier to read either. That is why I wanted to see the new syntax in real code. You will see that I didn’t make any comment on which version is better. It seems to me that the exact proposal, “funnel operator for generator pipeline” isn’t as ergonomic as claimed.

On the other hand, a general pipeline operator, that doesn’t need generators, per se, seems quite elegant to me.

I have some wrapper code around opencv, which override __or__. And custom Image object, which wrap around np.array. It overrides __or__ as well. Which allows me to write code like:

cleaned_image = (
    image  # Custom Image object
    | HistEq()
    | Threshold(thresh=128, maxval=255, otsu=False)
    | Erode((5, 5))
    | Dilate((3, 3))
    | Morphology.open(Kernel.ellipse((3, 3)))
)

This doesn’t work with method chaining, because I will have to implement all possible methods and cram into Image object.

Furthermore, here intermediate variables don’t make sense, and these operations are empirically obtained. This also lets me comment out or add operations as I see fit, without having to work much.

It seems to me like you are saying this is objectively bad. Well I don’t agree. I think this is precisely the point of sum, to be more declarative. It seemed quite natural to me. I, personally, don’t prefer the total=0;total+=1 pattern. The same reason I tend to use enumerate instead of i=0;i+=1. On top of that, I guess my version would be a bit faster in practice, without losing any readability. The original dataset for that problem is somewhere around 1000-10000 lines long. Anyway, this is offtopic. Let’s agree to disagree.

EDIT:

I am sure I didn’t get this. Someone else’s code is almost guaranteed to be filled with their own custom functions, unless they don’t define any function at all? Are you saying their own implementation of stdlib functions? I don’t think anyone is proposing something like that.

2 Likes

Yes, this is what I’ve been struggling to express when I say that pipelines are a good fit in other languages, but feel like they don’t fit well in Python.

People can follow such a convention locally (in their own code, or maybe even across a particular discipline) but generally, that’s not enough for a language feature (or to a lesser extent a stdlib function).

1 Like

Now that I think of it OpenCV is actually a good example of a library that benefits. Most of their APIs take a numpy array and return a new one

Technically the or operator already has Symantec meaning “l” as in apply OR to both images

So overriding that breaks your ability to do that if you override that in the numpy class

You can do intermediate assignment to variables but in OpenCV that causes memory copies meaning you get the same result but at the cost of more memory especially if you don’t use same intermediate variable

OpenCV settled on adding a final argument for most of their APis that allows you to do the maths in place.

Numpy itself has also had to do the same thing.

Which can kind of be clunky to do as a result most people reach for the easier to write in Python but terrible for performance option of using operators or temporary variables

Theoretically let’s say we added this operator of |>

It could allow numpy and openCV to actually be able to know you want to do an operation in-place and allow for optimizations that just aren’t possible with current langauge design without special arguments in functions, dedicated more performant functions which do same thing as operator or just wrapping the code in your own class which does these things

For example take this


import cv2
import numpy as np

# Load image
img = cv2.imread('image.jpg')

# OpenCV copies: every operation returns a new array
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(gray, (5, 5), 0)
edges = cv2.Canny(blurred, 50, 150)
dilated = cv2.dilate(edges, np.ones((3, 3), np.uint8), iterations=1)

# NumPy copy
brightened += dilated
brightness += 60 

cv2.imshow("With Copies", brightened)
cv2.waitKey(0)
cv2.destroyAllWindows()

Easy to read but the user has been making copy of the images each time

The only way you can currently indicate you want it in place is like so


import cv2
import numpy as np

# Load image (copy so we can reuse original)
img = cv2.imread('image.jpg')
in_place = img.copy()

# OpenCV in-place: some functions support a 'dst' argument
gray = cv2.cvtColor(in_place, cv2.COLOR_BGR2GRAY)

# Overwriting original variable, still reuses memory after gray
cv2.GaussianBlur(gray, (5, 5), 0, dst=gray)  # In-place blur
cv2.Canny(gray, 50, 150, dst=gray)          # In-place Canny
cv2.dilate(gray, np.ones((3, 3), np.uint8), iterations=1, dst=gray)  # In-place dilate

# NumPy in-place: modify gray directly
np.add(gray, 50, out=gray, casting="unsafe")  # In-place pixel boost

cv2.imshow("In-Place", gray)
cv2.waitKey(0)
cv2.destroyAllWindows()

So @pf_moore this would be one thing you can’t really do in a fully explicit way in current language design that feature could unlock

You can find ways to chain with current language even if it’s clunky and therefore people may not use until the syntax is more useful (or maybe they will)

But you cannot hint to a library you want to do an in-place operation. Instead library has to decide on some kind of convention for doing this

With a funnel operator that could unlock that


img = cv2.imread("image.jpg")

result = img \
    |> cv2.cvtColor(_, cv2.COLOR_BGR2GRAY) \
    |> cv2.GaussianBlur((5, 5), 0) \
    |> cv2.Canny(50, 150) \
    |> cv2.dilate(np.ones((3, 3), np.uint8), iterations=1) \
    |> np.add(50, casting='unsafe')

Whilst the code is more or less the same and we can argue about if it’s more or less readable or not

What is maybe harder to argue about is that this would allow the library to determine that the user is chaining operations and therefore should be doing in-place optimizations when possible which right now require the user to learn that there is an entirely separate way to use these libraries

Potentially this is what would justify this addition ? Basically the ability to specify a “explicit in-place pipeline”

But apart from that I would broadly agree that a pipe function is probably a better place to focus on than syntax

1 Like

What about a Pipeline class that implements __or__ and __lt__, like plumbum?

For example,

class Pipeline:
    def __init__(self, *args):
        ops = []
        for arg in args:
            if isinstance(arg, Pipeline):
                ops.extend(arg.ops)
            else:
                ops.append(arg)
        self.ops = tuple(ops)

    def __or__(self, other):
        return Pipeline(self, other)

    def __lt__(self, other):
        value = other
        for op in self.ops:
            value = op(value)
        return value

img = cv2.imread("image.jpg")

result = img > (
    Pipeline()
    | (lambda _: cv2.cvtColor(_, cv2.COLOR_BGR2GRAY))
    | (lambda _: cv2.GaussianBlur(_, (5, 5), 0, dst=_))
    | (lambda _: cv2.Canny(_, 50, 150, dst=_))
    | (lambda _: cv2.dilate(_, np.ones((3, 3), np.uint8), iterations=1, dst=_))
    | (lambda _: np.add(_, 50, casting="unsafe", out=_))
)

or …

def wrap_function(func):
    def wrapper(*args, **kwargs):
        assert "dst" not in kwargs
        def inner(img):
            func(img, *args, **kwargs, dst=img)
            return img
        return inner
    return wrapper

for name in ["cvtColor", "GaussianBlur", "Canny", "dilate"]:
    globals()[name] = wrap_function(getattr(cv2, name))

img = cv2.imread("image.jpg")

result = img > (
    Pipeline()
    | cvtColor(cv2.COLOR_BGR2GRAY)
    | GaussianBlur((5, 5), 0)
    | Canny(50, 150)
    | dilate(np.ones((3, 3), np.uint8), iterations=1)
    | (lambda _: np.add(_, 50, casting="unsafe", out=_))
)
1 Like

To get the performance benefit here, would you need a __rcall__ method here, so that cv2 could implement that |> uses inplace modification?

Otherwise it seems to me that the only benefit would be that the intermediate results can be garbage collected earlier?

1 Like

Yes I believe so. It would help with GC by never allocating that memory it never needed in first place.

Take this toy sample but happens a lot in my image processing code but with much more complicated algo with multiple steps (I can supply samples if requested)

image = cv2.imread('largeimage.jpg')
image = cv2.GaussianBlur(image, (5, 5), 0)

This is basically functionality the same as

image = cv2.imread('largeimage.jpg')
cv2.GaussianBlur(image, (5, 5), 0, dst=image)

You would think that this would not make any meaningful difference, and yet the amount of memory that can be allocated via the temporary copy the first code needs to do before setting it to the image variable is enough to make the code crash in production from lack of RAM because the GC just cannot react fast enough to the fact the original value is now out of scope.

Thats before you even see the cases where programmers accidentally create temporary variables to store intermediate that stick around

Its not uncommon to see this, since it’s easier to read.

image = cv2.imread('largeimage.jpg')
image_blur = Gcv2.aussianBlur(image, (5, 5), 0)
  • Well it kind of breaks the semanticas as “|” as numpy uses that to do an OR,
  • You now need to modify the functions globally which I’m not sure is a good idea,
  • Your code needs to handle this for all functions even for ones where it might not be a good idea or work out based on parameters if it’s a good idea (some openCV functions don’t work with in-place depending on settings you pass in)
  • You now need to opt out of this by passing dst=None) everywhere.
  • Numpy still has it’s own separate way of indicating this via “out” instead of “dst”

It feels like me, if I could hint to these libraries “you are part of a pipeline therefore when you can please do an in-place operation” via |> it’s probably the one and only thing that could justify this feature.

Since __or__ is implemented by another class, you can still do something like this without conflicting:

result = (
    (img > Pipeline(lambda x: x * 2) | (lambda x: x + 1))  # `|` for pipeline
    | np.ones((3, 3), np.uint8)  # `|` for bitwise or
)

or you can do some functional stuff:

result = img > (
    Pipeline()
    | (lambda x: x * 2)
    | (lambda x: x + 1)
    | (lambda x: np.bitwise_or(x, np.ones((3, 3), np.uint8)))
)

All of these arguments also apply to your proposal, because neither numpy nor opencv currently supports currying. You would still need to write your own wrappers for them, or wait for numpy and opencv to support it.

They tend to create functions just to fit the pipeline syntax and follow the functional programming paradigm. It’s similar to the sum example, where the author creates two generator expressions instead of simply using a for loop.


@jamsamcam A few things to keep in mind:

  1. Will the pipeline stop when the iterator becomes empty, or will it keep calling functions with an empty iterator?
  2. Are there any benchmarks showing that using multiple generator objects is actually faster than a simple for loop?
  3. Are you open to providing a code patch for formatters to support the syntax style used in many of the previous examples?
  4. Is it really that inconvenient to type the two-key Shift combinations needed for |>?

@pf_moore I’ve updated the implementation. As before, you can test it online here.

The new transformation behaves as follows:

From:

[1, 2, 3] |> [ x ** 2 for x in _ ] |> map(str) |> ", ".join() |> print()

To:

(lambda _: print(_))(
  (lambda _: ", ".join(_))(
    (lambda _: map(str, _))(
      (lambda _: [ x ** 2 for x in _ ])(
        [1, 2, 3]
      )
    )
  )
)

The auto-injection behavior is controlled by the presence of _ on the RHS. If _ is present anywhere on the RHS, the auto-injection will not happen. This will protect from unnecessarily complicated scenarios. Regarding “auto-lambda” / “auto-partial” behavior (i.e. transforming _ + 2 or ~ + 2 into lambda x: x + 2) - I don’t think the pipeline implementation should be responsible for that. What do you think?

PS. The implementation can be a bit brittle due to heavy use of macros. I will fix that and also likely make the involved bits into a more universal “core” AST walker which I was shocked not to have found in the existing codebase.

PS2. As opposed to the previous implementation, this time it combines with method calls as suggested above. Try:

upper = lambda x: x.upper()
"abc" |> upper().lower().capitalize()

or

"abc" |> _.upper().lower().capitalize()

work as “expected”.

PS3. Obviously this time it can handle any expressions on the RHS not just calls. This is a big power-up.

One more dummy example:

[1,2,3] |> [ x ** 2 for x in _ ] |> map(str) |> ", ".join() |> f"here are the results: {_}" |> _.capitalize()

I actually already did this in one of my projects! GitHub/sharktide/restructuredpython was a Python fork I was making and I added this as a feature!

Anyone can check out the docs here note this is item 2

PS. I’ve started working on the PEP.

Hello,

I’ve previously used method chaining in C# and JavaScript.
It’s concise and intuitive, but unfortunately, it’s harder to inspect the return value of each function.
I also worry that funnel (or pipeline) operators have the same issue.

So eventually, I switched to this approach (as shown in the image below).

Although it’s not visually appealing, it saves me the effort of naming temporary variables
and allows me to clearly see how the data changes after each function.

3 Likes

IMHO the contemporary limitations of the tooling (is there a problem with displaying the return value of a call in pdb?) should not limit the design of the language to the point of forcing the above syntax which in my opinion is very bad.

We should also not confuse debug code which abuses the namespace and creates a ton of locals with production code. Is this really production code? It definitely is unintuitive, error-prone and requires maintenance of all those numberings to match. Innovation around the pipeline concept could allow this instead:

class Debug:
  def __init__(self):
    self.record = {}
  def __call__(self, tag, value):
    self.record[tag] = value
    print(f'{tag}: {value}')
    return value
  def __repr__(self):
    return repr(self.record)

if __debug__:
  debug = Debug()
else:
  debug = lambda tag, value: value

(
  number_list |> debug("number_list") |>
  abs_func() |> debug("abs_func") |>
  [ x for x in _ if x > 5 ] |> debug("filter_func") |>
  await count_func() |> debug("count_func") |>
  show_result() |> debug("show_result")
)

> number_list: [1, -2, 3, -4, 5, -6, 7]
> abs_func: [1, 2, 3, 4, 5, 6, 7]
> filter_func: [6, 7]
> count_func: 2
> show_result: 'Good'

print(debug)

> { 'number_list': [1, -2, 3, -4, 5, -6, 7], 'abs_func': [1, 2, 3, 4, 5, 6, 7], 'filter_func': [6, 7], 'count_func': 2, 'show_result': 'Good' }

PS. I made a first pass at the Rationale and Specification in the PEP. Feedback welcome.

1 Like

A pipeline / method chaining approach that allowed to introspect intermediate results would be a killer feature.

2 Likes

IMHO this is a tooling issue not a syntax issue. Line-by-line execution should cover this as the RHS stays mapped to the correct source code lines post-transformation in the current reference implementation and each RHS is transformed to a lambda call. pdb would be more powerful if it could show the values of evaluated expressions / calls as it goes line by line. Alternatively, see the solution above.