Introduce funnel operator i,e '|>' to allow for generator pipelines

dg-pb · October 24, 2024, 7:09am

Agreed, this is in no way acceptable final result of such endeavour, but can be a fairly good starting point.

This is what I have managed to pull together:

_ = Placeholder
pipeline = partial(opr.add, 1) -C- partial(opr.sub, _, 1)
pipeline(2)                                          # 2
2 |A| partial(opr.add, 1) -C- partial(opr.sub, _, 1) # 2
[1, 2] |AS| opr.sub -C- partial(opr.mul, 2)          # -2
[11, 3] |AS| divmod -CS- opr.mul                     # 6
[1, 2] |AS| opr.sub -C- split([opr.pos, opr.neg]) -C- sum  # 0

This is a working code.
Design and functionality improvements can now be addressed separately:

partial improvements to specify positional order of inputs
partial at parser level
pipe implementation
other useful utilities
more convenient operators

blhsing · October 24, 2024, 7:15am

As a demonstration I’ve made my pipe compose functions instead when it is not yet given an object:

_NOTSET = object()

class Pipe:
    def __init__(self, obj=_NOTSET):
        self.obj = obj
        self.funcs = []

    def __or__(self, func):
        if self.obj is _NOTSET:
            if not self.funcs:
                self = Pipe()
            self.funcs.append(func)
            return self
        return Pipe(func(self.obj))

    def __ror__(self, obj):
        if self.funcs:
            for func in self.funcs:
                obj = func(obj)
            return obj
        return Pipe(obj)

    __call__ = __ror__

pipe = Pipe()

So with the using class I suggested in this post it can both perform immediate calls and compose functions for a later call:

'abcde' | pipe | batched << using(2) | map >> using(''.join) | list | print
# outputs ['ab', 'cd', 'e']

pairs = pipe | batched << using(2) | map >> using(''.join) | list | print

'abcde' | pairs
# outputs ['ab', 'cd', 'e']

pairs('abcde')
# outputs ['ab', 'cd', 'e']

Demo here

pf_moore · October 24, 2024, 7:52am

dg-pb:

And to achieve function composition with this one would need to use lambda:
pipeline = lambda arg: arg |> func |> func2
Which is not very elegant. And given all the work that this would require somewhat disappointing outcome.

As a comment on readability, I’ll point out that even though I’m familiar with languages with pipe operator functions, and I’ve been reading this thread, I have no idea what that pipeline function is intended to do (specifically what the point is of the initial lambda).

If I saw something like that in code I was reviewing, I’d ask for it to be rewritten so the intent was clearer. Which suggests that the goal of more readable pipelines is not being achieved…

dg-pb · October 24, 2024, 8:15am

I like using idea.

But I would use it with Placeholder for simpler implementation and not needing to use 2 operators. This way, one can just use mental model of partial.

Also, to keep Pipe simple, can use brackets.

from itertools import batched
from functools import partial, Placeholder as _


class using:
    def __init__(self, *args, **kwds):
        self.args = args
        self.kwds = kwds

    def __call__(self, func):
        return partial(func, *self.args, **self.kwds)

    __rlshift__ = __call__


class Pipe:
    def __init__(self, *funcs):
        self.funcs = funcs

    def __or__(self, func):
        funcs = self.funcs + (func,)
        return type(self)(*funcs)

    def __ror__(self, obj):
        for func in self.funcs:
            obj = func(obj)
        return obj

    __call__ = __ror__


'abcde' | (Pipe() |
           batched << using(_, 2) |
           map << using(''.join) |
           list |
           print)

blhsing · October 24, 2024, 8:40am

I was aiming to make the usage involve as few symbols as possible but yes, your version would make the implementation simpler.

dg-pb · October 24, 2024, 10:19am

I think this would cover 90% of cases.

What about this?

def final_func(a, b, *, c):
    return (a - b) / c

obj = 1
a = add(obj, 1)
b = add(obj, 4)
c = range(a, b)
d = list(c)
e = final_func(d[2], d[0], c=d[1])
print(e)    # 0.6666666

pipeline = ?
print(pipeline(1))    # 0.6666666

bwoodsend · October 24, 2024, 10:20am

I also have reservations about readability. You remove some brackets but that’s not much of a win when I can’t figure out what about 80% of the examples proposed on this page are supposed to do. And for all of the few that I can understand (or helpfully have the output included in the example), I’m thinking wow, that would have been so much clearer as a comprehension loop!

The analogy to UNIX shells raises more questions than it answers. A shell pipe can only receive one input and it’s up to the receiving executable to decide whether stdin.read() is one object or a newline/null delimited array to loop over or an xargs style sequence of arguments. That concept doesn’t transfer to a language where functions take arbitrary arguments and keyword arguments.

sayandipdutta · October 24, 2024, 12:02pm

FWIW, I should mention, a pipeline operator should be general, in that, not limited to generators. For example, I have a few microservices, that pass around different upload IDs (UUID). If there were a pipe operator I could use it like this:

path_to_image
|> upload  # returns uploadID
|> remove_spots(~, max_area=30)  # returns id after spot removal
|> convert_to_png  # returns id of the png image
|> ocr # returns text
|> retrieve(query, ~)  # returns relevant results

Now granted, I can do this with reduce:

reduce(
    lambda x, f: f(x),
    [
        upload,
        partial(remove_spots, max_area=30),
        convert_to_png,
        ocr,
        lambda text: retrieve(query, text),
    ],
    path_to_image,
)

With Placeholder:

_ = Placeholder
reduce(
    lambda x, f: f(x),
    [
        upload,
        partial(remove_spots, max_area=30),
        convert_to_png,
        ocr,
        partial(retrieve, query, _),
    ],
    path_to_image,
)

I find the pipeline operator to be much more elegant and readable.

Of course, there is the other version, which is the most readable of all:

upload_id = upload(path_to_image)
spotless_id = remove_spots(upload_id, max_area=30)
png_id = convert_to_png(spotless_id)
text_results = ocr(png_id)
answer = retrieve(query, text_results)

But note, here I am having to come up with names for variables that are throwaway in this case. The other option is:

_ = upload(path_to_image)
_ = remove_spots(_, max_area=30)
_ = convert_to_png(_)
_ = ocr(_)
answer = retrieve(query, _)

Now _= does behave like poor-man’s pipeline operator, but I when I look at this code, it makes me sad.

elis.byberi · October 24, 2024, 12:47pm

If that is the issue, then I don’t think there’s anything to solve. First, the current function chaining isn’t what you’re referring to, and second, if you’re converting a generator to a list, you’re misusing generators.

I’m not understanding the other posts—what exactly is the problem being solved?

zhangyx · October 24, 2024, 5:58pm

1. I remember someone earlier in this thread mentioned that if a callable is chained in the pipeline, it will be automatically called and a generator is expected to be returned.

I think this idea can be replaced by non-generator functions that handles one item at a time:

generator = range(3) | str
# Equivalents to `map(str, range(3))`

list(generator) # ["0", "1", "2"]

# or even:

generator = range(3) | (lambda x: x + 1) | print

list(generator) # [None, None, None]

# Side effect - prints:
# 1
# 2
# 3

Reason behind this is that, if a function returns a generator, it very likely needs additional arguments other than pipelined values. For example:

def add(b):
    a = yield
    while True:
        a = yield a + b

# Usage
range(3) | add(10) # 10, 11, 12

Hence it might be too wasteful to automatically call a callable just for omitting a pair of empty braces. Not to mention that a callable object can also be an iterable (supports __iter__) and a generator (supports send) at the same time.

2. Also, as pointed out in multiple posts, the funnel operator will return a generator, nothing will be executed unless the generator is (later) being iterated.

Solutions have been proposed by chaining either a list, set, dict whenever the pipeline is supposed to drain itself immediately. However, this semantic can also be used to convert each item into corresponding types (e.g. enumerate(range(3)) | list should return a generator that generates [0, 0], [1, 1], [2, 2] instead of immediately drain the pipeline and return [(0, 0), (1, 1), (2, 2)]. Therefore, a “finalizer” helper might be helpful:

range(3) | str | finalize(list) # Returns ["0", "1", "2"]
# Equivalents to `list(map(str, range(3)))`

dg-pb · October 24, 2024, 9:42pm

I think there needs to be a clear separation between “iterator piping” and “function composition”.

And leave such experimentations for 3rd party libraries. At least for now.

blhsing · October 25, 2024, 2:45am

Sayandip Dutta:

FWIW, I should mention, a pipeline operator should be general, in that, not limited to generators. For example, I have a few microservices, that pass around different upload IDs (UUID). If there were a pipe operator I could use it like this:
path_to_image
|> upload  # returns uploadID
|> remove_spots(~, max_area=30)  # returns id after spot removal
|> convert_to_png  # returns id of the png image
|> ocr # returns text
|> retrieve(query, ~)  # returns relevant results

Great example, but I think the problem is that some people when looking at the various syntaxes proposed so far find them confusing because there’s no clear indication that a pipeline pattern is about to follow an object being piped.

To improve clarity I think we can introduce a dedicated statement with a new keyword so its body consists of unmistakably call specifications rather than expressions. Something like:

pipe path_to_image:
    |> upload  # returns uploadID
    |> remove_spots(_, max_area=30)  # returns id after spot removal
    |> convert_to_png  # returns id of the png image
    |> ocr # returns text
    |> retrieve(query, _)  # returns relevant results
    => result # assigns the final returning value to result

And there’s precedent in the match-case statement, in which Point(x, y) is not treated as a call to Point with arguments x and y but rather a specification of a match pattern, and where _ has special meanings.

Furthermore, following the logics of partial, we only need a _/Placeholder in the call specification only if it isn’t the last positional argument. As a toy example for easier illustration:

pipe 'abcde':
    |> batched(_, 2) # or batched(n=2) to avoid using a placeholder
    |> map(''.join) # no need for _ because the piped object follows ''.join
    |> list
    => paired # paired becomes ['ab', 'cd', 'e']
    |> print # possible to continue piping after an intermeidate assignment

sayandipdutta · October 25, 2024, 6:55am

One disadvantage of scope syntax is it prevents us from using it in a lambda. For me that’s okay, stylistically that should be avoided anyway. But the advantage (more on that later) is we don’t have to wrap it in parentheses to make it multiline, a formatter can ensure that when it is in a pipe block.

But I am starting to feel explicit call and placeholder should be mandated. For example, in map(''.join), although it is clear for me that it functions like a partial, map(''.join, _) may be more explicit. Which also brings it closer to structural pattern matching. And typing 3/4 extra characters is a small price to pay. Similarly, I am hesitant on the intermediate assignment syntax, even here we could repurpose as just like match-case:

pipe 'abcde':
    |> batched(_, 2)
    |> map(''.join, _)
    |> list(_) as paired
    |> print(_)  # or print(paired)

Where paired could be used nominally in subsequent functions as argument if needed. And when I use _ as a placeholder here, I don’t mean _ = functools.Placeholder but a soft-keyword (I think that is what you were suggesting).

Although => as the final assignment looks pretty, there isn’t much being added there to warrant a new token.

Some questions that come to mind:

Is it okay, to unpack _? e.g. |> print(*_)
Is it okay, to use it in fstring? i.e.
|> print(f"paired = {_}") (if we hadn’t done an intermediate assignment)
Can scope block be made atomic if needed? e.g. atomic pipe 'abcde'
can the pipeline itself be assigned? e.g.

pipe as pipeline:
    |> batched(_, 2)
    |> map(''.join, _) as paired # what does intermediate assignment mean here?
    |> list(_) # return type inferred from here

pipeline(iterable1)
pipeline(iterable2)

First two seem fine to me. I am not sure about the feasibility of 3rd.
I am not convinced about 4 myself. I would prefer writing a wrapper function if I want to reuse a pipeline.

Speaking of advantages, as a consequence of using pipe as block, we don’t need ~. And if needed, |> could be dropped as well, any of |, >, >>, -> can be used to represent pipeline operator inside a pipe block.

pipe 'abcde':
    >> batched(_, 2)
    >> map(''.join, _)
    >> list(_) as paired
    >> print(_)

EDIT: scratch atomic idea, I don’t think it is possible. There could be arbitrary functions with side-effects.

dg-pb · October 25, 2024, 7:01am

Ben Hsing:

pipe path_to_image:
    |> upload  # returns uploadID
    |> remove_spots(_, max_area=30)  # returns id after spot removal
    |> convert_to_png  # returns id of the png image
    |> ocr # returns text
    |> retrieve(query, _)  # returns relevant results
    => result # assigns the final returning value to result

With current toolkit one can implement:

path_to_image | (pipe
    >> upload
    >> remove_spots@sub(_, max_area=30)
    >> convert_to_png
    >> ocr
    >> retrieve@sub(query, _)
) == result

blhsing · October 25, 2024, 7:22am

dg-pb:

With current toolkit one can implement:

path_to_image | (pipe
    >> upload
    >> remove_spots@sub(_, max_area=30)
    >> convert_to_png
    >> ocr
    >> retrieve@sub(query, _)
) == result

The == operator cannot perform an assignment.

dg-pb · October 25, 2024, 7:23am

You are right, then would have to be this:

result = path_to_image | (pipe
    >> upload
    >> remove_spots@sub(_, max_area=30)
    >> convert_to_png
    >> ocr
    >> retrieve@sub(query, _)
)

sayandipdutta · October 25, 2024, 7:25am

I’d rather use partial(remove_spots, _, max_area=30) instead of this trick. And TBH, I have quite a few codebases where I use this kind of pattern. But having to use object | (pipe | ...) or object | (pipe >> ...) leaves a bad taste. It works quite well when I have some object of my own, where I have overloaded __or__/__rshift__ and their siblings, to begin with. That is why having an operator/some mechanism that is applicable on naive objects seems so useful.

Furthermore, @blhsing’s version has a nice side-effect of intermediate assignments, which could be useful once in a while.

blhsing · October 25, 2024, 8:07am

Yeah I believe mixing a pipeline with other expressions can easily make the code unreadable when the pipeline has different grammar rules.

But I am starting to feel explicit call and placeholder should be mandated. For example, in map(''.join), although it is clear for me that it functions like a partial, map(''.join, _) may be more explicit. Which also brings it closer to structural pattern matching. And typing 3/4 extra characters is a small price to pay. Similarly, I am hesitant on the intermediate assignment syntax, even here we could repurpose as just like match-case:
pipe 'abcde':
    |> batched(_, 2)
    |> map(''.join, _)
    |> list(_) as paired
    |> print(_)  # or print(paired)

I like the as idea to avoid spending a line just to name an intermediate value.

Agreed about mandating a placeholder when there are other arguments in the specification, but I think we can still allow the simplest use case to be written with a bare callable like list and print above because it would remain unambiguous where the piped object is placed when it is the only argument. But then when there’s an as clause I do think a placeholder should be mandated to avoid list as paired looking as if we’re assigning list to paired.

Good idea about making intermediate variables immediately reusable for subsequent calls, as they should, with a subsequent call specification evaluated only after its preceding call returns.

In fact, _ can also be simply implemented as an intermediate variable storing the last returning value.

So yes, if _ is simply a normal variable storing the last returning value then it can be used in any expression like the two above.

I don’t see a need for a separate scope. The pipe statement should be more like match-case rather than def and class.

can the pipeline itself be assigned? e.g.
pipe as pipeline:
    |> batched(_, 2)
    |> map(''.join, _) as paired # what does intermediate assignment mean here?
    |> list(_) # return type inferred from here

pipeline(iterable1)
pipeline(iterable2)
First two seem fine to me. I am not sure about the feasibility of 3rd.
I am not convinced about 4 myself. I would prefer writing a wrapper function if I want to reuse a pipeline.

I agree. An alternative syntax that defines a pipeline function would be convenient but may make the statement too confusing as performing immediate calls within the current scope and defining a function with its own scope are two distinctly different operations.

Speaking of advantages, as a consequence of using pipe as block, we don’t need ~. And if needed, |> could be dropped as well, any of |, >, >>, -> can be used to represent pipeline operator inside a pipe block.
pipe 'abcde':
    >> batched(_, 2)
    >> map(''.join, _)
    >> list(_) as paired
    >> print(_)

Agreed that we don’t need to introduce new tokens within a dedicated pipe block. I still like | slightly better because I associate | with a pipe, and if bare callables are allowed in the simplest case as I suggested, the above would become:

pipe 'abcde':
    | batched(_, 2)
    | map(''.join, _)
    | list(_) as paired
    | print

syntaxfiend · November 10, 2024, 3:36am

I am a fan of this idea! In fact i created an account to propose something almost the same thing. I would really like to see @ __matmul__ used for this. I think it would be fit into the language quite nicely being that the symbol is used for decorators already. Although I have no idea how prevalent the use of matrix multiplication is. Anecdotally I’ve used it maybe 4 times in 6 years. I also think that | is already used extensively in the language. Standard bitwise or, merging dicts/mappings and type unions. I think it would be good to distinguish it with different syntax. Although it does remind me of piping and makes sense in that way.

I also really hope that unpacking is built into it as well. So that something like
(g @ f)(x) is equivalent to g(f(x))
(g *@ f)(x) is equivalent to g(*f(x))
(g **@ f)(x) is equivalent to g(**f(x))

and in my dreams this would also be true
(g ***@ f)(x) is equivalent to g(*y[0], **y[1]) # where y = f(x)
but that’s a whole other topic I think.

Regardless I’m quite excited about the prospect of this. I’ll have to re-read through the whole thing more carefully when I have some more time.

edit: add brackets and inputs to the compose statements.

petercordia · November 10, 2024, 4:43pm

+1 to using @ for function composition.
But i think that’s a different topic.
f@g@h(x) :== f(g(h(x)))
Whereas this thread discussed the syntax
x|>f()|>g()|>h() :== h(g(f(x))) (and variations thereof)