Introduce funnel operator i,e '|>' to allow for generator pipelines

jamsamcam · October 12, 2024, 11:04pm

This is my first time trying to write out a proposal so bare with me. I think generators are an amazing tooll for generating sequences that would typically end-up needing a heck of a lot of for loops and other such code to benefit from the re-usability + lazy revaluation they offer.

But in my experience I find there are two cases where they fall short.

It’s easy to end up with a rats nest i.e tqdm(enumerate(windowed(brackets, 2)))
Or can un-nest it but I’ve had many bugs caused by the order of lines being mixed up or people forgetting they need to convert to a list because what they are using is a generator because the list is stored in a variable with another name

I would love to keep the compactness and colocatedness of option 1 with the readability of 2.

Therefore my proposal is to introduce a funnel operator, it would accept an iterable on the left-hand side, a generator factory on the right and return an iterator of it’s own. Each invocation would be evaluated from left to right.

So take this for example

[message1, message2, messge 3] |> filter(lambda message: not message.deleted) |> map(lambda message: translate(message, 'french')

The list would be passed to the filter generator factory and return a generator which would then be passed to the map generator factories. Generator factories are just functions which accept an iterator and then return a generator which consumes that iterable.

So we end up with this kind of nice succinct code from my example ealier

for index, bracket in brackets |> windowed(2) |> enumerate() |> tqdm():
  # Code here
   pass

So why generator factories at all? At first I thought we could use some kind of system of currying the iterable to the existing generators, but unfortunetly the existing ones in itertool and python often have inconsistent interfaces where sometimes the iterable needs the be the first, second or N argument.

Therefore we need a factory function which handles the details of how to construct these generators. This also may allow for backwards compatibility for example if map were to become a factory since the function could perhaps handle cases where it’s called expecting to be a generator vs a factory.

But yeah I’m not really sure how you would pull this off regarding backwards compatability.

mikeshardmind · October 13, 2024, 1:31am

I like pipelines like this in general in most languages that have them, I’m not sure we should restrict this to generators though. Would be great if this just forwarded values and had behavior defined in a way that works for anything being forwarded on.

It could probably be done by providing a namespaced set of factory functions

itertools.pipeable (any other name or location possible)

where the public functions within that namespace are functions that return functions taking only an iterable.

I can’t think of a better way that has semantics that are nice to work with. It might be possible with a special dunder, but this would require detecting function calls that are part of a pipeline and handling them differently, which I’m not enthusiastic about.

JamesParrott · October 13, 2024, 11:45am

itertools.pairwise does the job of functools.partial(windowed,n=2) if step and fillvalue are not required.

To the point, as Michael says, why limit the possible applications of the necessary work to generators? Why exclude other callables? I thought about homebrewing a Currying syntax too, using __rrshift__ and >>, accepting and returning kwargs. It’s a common pattern to manually create a compose function, and use that with functools.reduce.

At first I thought we could use some kind of system of currying the iterable to the existing generators, but unfortunetly the existing ones in itertool and python often have inconsistent interfaces where sometimes the iterable needs the be the first, second or N argument.

This is fixable with partial calls and introspection. Maybe explicit is better than implicit, however.

elis.byberi · October 13, 2024, 1:10pm

Using pipe operator:
brackets |> windowed(2) |> enumerate |> tqdm

The first parameter is passed implicitly.

Also, let’s take a look at this example:
range(42) |> somefunc |> print which is equivalent to:

for i in range(42):
    print(somefunc(i))

That’s clearly a for loop, so in this case, it cannot return a value:
a = range(42) |> somefunc |> print wouldn’t make sense.

picnixz · October 13, 2024, 1:22pm

Would GitHub - EntilZha/PyFunctional: Python library for creating data pipelines with chain functional programming be of use to you? While it doesn’t use pipes, it’s somewhat close to Java Stream API and makes the code a bit cleaner (but I don’t know about its performances).

I would personally be happy to have this in the standard library but I think this could be first done as a separate package where you create a wrapper object that implements its own __or__ method. This kind of feature will probably require a PEP because it’s non-trivial and affects the parser and its performances, though it would definitely help in readability (you wouldn’t have to introduce a bunch of temporary variables). The pipeline operator is somewhat exposed in Javascript using RxJs but is native in Elixir, Erlang, Haskell and probably others, so we have at least some precedence. However, those are functional languages by essence (or at least that’s how I would categorize them), which Python is not.

Note that brackets |> pairwise |> enumerate |> tqdm would be preferrable because enumerate() needs to be detected as being part of the pipeline and not being a function call, so it would probably be more work on Python’s side (I assume the use of enumerate() with parentheses is because you’re taking Elixir/Erlang syntax where you refer to the functions in the pipeline like this).

That’s clearly a for loop, so in this case, it cannot return a value:

We could simply return None in this case I think.

elis.byberi · October 13, 2024, 1:34pm

42 None values, to be exact. (Or it seems that it is becoming confusing!)

picnixz · October 13, 2024, 1:36pm

Mmh. Yes but that’s not really an issue I think. We can have a |> /dev/null call at the end to suppress whatever it is being returned (i.e., instead of a map(), consider it to be a forEach()). Or change the syntax, e.g., |@ print which would just call print on each result without returning anything (and the entire pipe would just return a single None).

jamsamcam · October 13, 2024, 2:33pm

Yeah something like that which wraps in in an object will be useful especially as it makes it clear how the or operator should work

There have been a few points I would like to clarify from this thread. I’m not married to any particular syntax but I think I was ultimately after the currying syntax but wasn’t sure how we could make this backwards compatible with all generators from itertools

Ideally

[1,2] |> filter(deleted)

Is same as

filter(deleted, [1,2]) or partial(filter, deleted)([1,2]

Treating it as a curry operator keeps it simple since it’s just another way of passing arguments in a pipeline like manner without needing nesting or temp variables

For this reason that’s why in my example the generators we don’t configure still have empty parameters because otherwise we would need to figure out how to handle

This

list |> generator

And

list |> generator(args)

Keeping it the same with a () allow python to just treat both case as a simple function call just with an applied argument at the end which is pushed by the return value of the thing on the left hand side of the operator

This also solves questions like what happens if they use print

Something like this
bar= list |> filter(deleted) |> print

Would just print the generator object and return none same as it does with bar =. print(filter(deleted, list))

Instead they would need to wrap it in a each like so
list |> filter(deleted) |> each(delete)

This would also unlock a nice way of saying you want it to be a list at the end

|> flatten |> list

Which could make sure we get a list rather than a generator and I could see it maybe allowing for patterns such as trailing closure blocks one day

blhsing · October 14, 2024, 1:57am

The problem is that oftentimes an iterable is not the only argument to a generator such as windowed and tqdm so it is necessary for the proposed syntax to accommodate additional arguments unless we are to clutter up the code with partial.

Making the piped iterable an implicit first argument to a call to a generator may be a necessary compromise to strike a balance between cleanliness and usefulness.

EDIT: One possible solution is to make the parentheses optional such that a call to the right operand is made with the piped iterable as the only argument if the right operand is callable:

brackets |> windowed(2) |> enumerate |> tqdm(unit="pairs")

blhsing · October 14, 2024, 5:41am

FWIW here’s one way to achieve a pipeline-like iterable-based transformation with the current syntax:

from itertools import batched

class pipeline:
    def __init__(self, iterable):
        self.iterable = iterable

    def __call__(self, generator, *args, **kwargs):
        return pipeline(generator(self.iterable, *args, **kwargs))

    def __iter__(self):
        return iter(self.iterable)

print(*pipeline('abcde')(batched, 2)(enumerate), sep='\n')

This outputs:

(0, ('a', 'b'))
(1, ('c', 'd'))
(2, ('e',))

One big issue is that many existing iterable helper functions take an iterable not as the first argument, but as the second. Examples include filter, map, reduce, starmap, takewhile, etc., and I don’t see a good way to allow specifying the position of the iterable while keeping the syntax clean, although one can always create a wrapper function to swap the position of the iterable argument to the first.

picnixz · October 14, 2024, 4:27pm

You can use >> for shifting the positions. The syntax would be ugly but it could help (pipeline('abcde') >> 1)(second, 1) with def second(number, letter): .... On the other hand, you could say that [1] acts as the shift, so that you don’t need the extras () because of >>, e.g., pipeline('abcdef')[1](second, 1) (still ugly).

You could also some .shift() method on pipeline objects or a shift function, e.g., pipeline('abcdef')(shift, 1)(second, 1).

blhsing · October 15, 2024, 1:33am

Interesting suggestions, but all of them look still too verbose and clunky to me.

Since there is almost no iterable helper function that takes an iterable as the third argument (the multiple-iterable form of map notwithstanding), a possibly more eye-pleasing syntax may be to use > to denote piping the iterable as the first argument and >> to denote piping as the second:

pipeline('abcde') > (batched, 2) >> (map, ''.join)

But by abusing a tuple for a call spec it means it doesn’t support keyword arguments, unless we take a tuple item as args and a dict item as kwargs:

pipeline('abcde') > (batched, (2,)) > (tqdm, {"unit": "pairs"})

But then it looks clunky again, so ultimately we still need a new dedicated syntax for the pipeline idea to work cleanly.

sirosen · October 15, 2024, 4:49am

If you gave yourself a name for the pipeline output, it might work better. One of the fun things about inventing completely new syntax is that you can create all sorts of interesting things like special names.
e.g.

x: list[int] = range(100) |> filter(lambda x: x % 2 == 0, PIPE) |> list(PIPE)

where PIPE is a magic name for the preceding pipe’s output.

Even so, and much as I love pipes, I can’t see this fitting into the language. I don’t think the motivation is strong enough.

If people are interested in this thread, maybe look at the pipe library on pypi:

jamsamcam · October 15, 2024, 11:06am

I think I’ve found a simple way that we could provide a default implementation of this in something like functools or itertools.

It’s not quite a nice operator but it provides the same desired functionality without any breaking changes or new libraries.

def chainable(f):
    
    def action(self, *args, **kwargs):
        f(self, *args, **kwargs)
        return self

    return action

class Chain:

    def __init__(self, iter):
        self.iter = iter

    def __iter__(self):
        return self.iter

    @chainable
    def chain(self, func):
        self.iter = func(self.iter)


def chain(iter):
    return Chain(iter)


chain([1, 2, 3, 4]).chain(lambda items: filter(lambda i: i < 3, items)).chain(lambda items: filter(lambda i: i > 1, items)).chain(lambda iter: [print(i) for i in iter])

x = list(chain([1, 2, 3, 4]).chain(lambda items: filter(lambda i: i < 3, items)).chain(lambda items: filter(lambda i: i > 1, items)))
print(x)

kalekundert · October 18, 2024, 2:54pm

I wrote another library, called pipeline_func, that might be of interest to those in this discussion. Here’s what it’s looks like in action:

>>> from pipeline_func import f, X
>>> a = [3, 4, 2, 1, 0]
>>> b = [2, 1, 4, 0, 3]
>>> a | f(zip, b) | f(map, min, X) | f(filter, bool, X) | f(sorted)
[1, 2, 2]

f is a class that wraps any arbitrary callable and implements the pipe syntax. X is an object that stands in for the output from the previous step, thereby accommodating functions like filter and map where the iterable is the second argument. By default, though, the first argument to the current step is the output from the previous step.

The X abstraction isn’t perfect; it doesn’t work if X is contained in another object (e.g. a list or a dict). To my knowledge, there’s no way to make an object that replaces itself with some other object the first time it’s accessed, which I think is what’s needed here. Maybe that would be a useful feature request for python itself. (It would be possible to replace X in nested data structures by pickling/unpickling each argument, but this would be way too much overhead.)

I’m probably biased, but I think this is a pretty nice syntax already. Here’s an example of what it looks like in a real project, which I think is a little easier to understand than the contrived example above.

blhsing · October 22, 2024, 2:36am

Nice syntax and library!

If, like in your example, most piped functions are called without any additional arguments:

x_conv = (
        x
        | f(self.conv1)
        | f(self.time, t)
        | f(self.bn1)
        | f(self.act1)
        | f(self.conv2)
        | f(self.bn2)
        | f(self.act2)
        | f(self.upsample)
)

I would probably offer an alternative syntax like this to avoid repeated calls to f:

x_conv = (
        pipe(x)
        | self.conv1
        | self.time << using(t)
        | self.bn1
        | self.act1
        | self.conv2
        | self.bn2
        | self.act2
        | self.upsample
)

JamesParrott · October 22, 2024, 10:01am

There’s a similar more established library called pipe

blhsing · October 22, 2024, 11:24am

I’m aware of the library but I didn’t bring it into the conversation because it requires a wrapper defined for each callable to be pipable.

petercordia · October 22, 2024, 3:38pm

I agree with the desire of the original post, though for general objects and not just generators.
Having opening and closing brackets far apart is ugly, hurts readability, and is annoying to write. And being forced to use intermediate variables leads to mistakes.

As an additional advantage, having such notation available would reduce the need for subclassing. (and wrappers.)

Though I would note that

for index, bracket in brackets |> windowed(2) |> enumerate() |> tqdm():
  ...

looked confusing to me at first.

I think it’s preferable to demand the brackets, so that f |> g can be interpreted as partial(g, f). The code you end up with that way looks more similar to other python code, specifically it will look very similar to method chaining, which python users are probably familiar with.
Then again it wouldn’t completely clear to me whether f |> g |> h should mean partial(h, partial(g, f)) or partial(partial(h, g), f) within that paradigm. Or whether that would even matter.

petercordia · October 22, 2024, 3:47pm

This reminds me of something someone else wrote, which was an alternative way to define lambda functions. I can’t remember exactly which symbol they proposed, but essentially

filter(lambda x: x % 2 == 0, __) === lambda PIPE: filter(lambda x: x % 2 == 0, PIPE).

Would there be a good reason to use a context-specific keyword, rather than introducing a more general tool that can be used to quickly define partial-ish lambda functions anywhere?

I don’t know whether your example or mine would work. It seems to me there is potential for it not to be clear where the function/expression that depends on PIPE ends. But if that problem can be fixed, then it can be fixed.