Allow str.join to take *args in addition to iterable (like min/max)

terrdavis · January 23, 2020, 8:19pm

I want to use str.join to directly construct multi-line strings, i.e.:

"\n".join(
    "line1",
    "line2",
)

but because only an iterable is accepted, I have to do it like this:

"\n".join(
    [
        "line1",
        "line2",
    ]
)

I don’t like the extra indentation, so I’ve resorted to using a helper function:

def lines(*strs):
    return "\n".join(strs)

There’s precedent for this type of flexibility, such as the built-in min & max functions.

Has this been discussed before?
I found this on stackoverflow, but there’s no reference to a discussion among python devs.

steven.daprano · January 23, 2020, 8:58pm

Terry Davis said:

“I don’t like the extra indentation”

Then don’t use it. It’s not actually mandatory.

If you are constructing multi-line string literals, as shown in
your example, using join is inefficient. Why construct them at run time?
You can use a triple-quoted string:

value = """line 1
line 2
line 3
"""

or compile-time string literal concatenation:

value = ("line 1\n"
         "line 2\n"
         "line 3\n")

depending on your taste.

terrdavis · January 23, 2020, 9:16pm

I’m using black for formatting, so I can’t omit indentation.

I want to construct them at run time because I’m lazy .

Compile-time literal string concatenation was my original approach.
The impetus for using str.join was to avoid having to add \n (or forgetting to…).
Triple quoted strings are either an eyesore:

def f():
    """line1
line2
line3"""

or have to be dedented with textwrap.dedent, which also doesn’t keep desired leading whitespace.

pf_moore · January 23, 2020, 9:44pm

That’s an issue between you and black, I guess . AIUI, the whole idea of black is that you’re not allowed to care about formatting, that’s black’s job (yes, that’s a joke, but there is a relevant point there).

Those hardly seem like sufficient reasons for a language change. Your helper function sounds like a fine solution - tailored to your preferences, easy to include in your projects, works in current versions of Python, so you don’t need to wait and upgrade.

I should also say, thanks for taking an interest in improving the language, and taking the time to do some research into the problem. But I don’t think this is likely to be sufficiently useful to get accepted.

ruud · January 24, 2020, 6:29pm

If you would like to change the functionality of join to a list of arguments, there are several other builtin that might benefit:

sum
list
tuple
set

I am sure whether this exhaustive.

steven.daprano · January 25, 2020, 6:50am

sum cannot be changed without breaking backwards compatibility. The
second positional argument is a starting value which is returned if
the first argument is empty:

> a = []
> b = sum([], a)
> b is a
True

If we changed sum to take an arbitrary number of positional arguments,
that would change the behaviour of the above to return []+a which is a
new list:

> []+a is a
False

aeros · January 25, 2020, 7:08am

Considering that you can already construct a list, tuple, or set with a sequence of arguments using [a, b, c], (a, b, c), or {a, b, c} respectively, I don’t see much practical benefit in adding this to their builtin functions.

ruud · January 25, 2020, 7:42am

steven.daprano:

sum cannot be changed without breaking backwards compatibility

I don’t think that’s right.
In the current implementation, the first argument has to be an iterable. And the second is an optional start value.
If we extend the sum function in such a way:

if the first parameter is an iterable, either no or just one extra parameter (the start value) is allowed
if the first parameter is not an iterable, all parameters are used in the sum. In that case we will need a keyword argument to specify the start value,

This looks backward compatible to me.

Examples
sum(1)
sum(1,2) new functionality, currently raising a TypeError
sum((1,2))
sum((1,2),3) uses 3 as the start value as it is currently implemented
sum(1,2,3,start=4) new functionality

On a sideline, I would sum to support strings. Unclear why this not supported: sum(('a', 'b', 'c'), '') .
I know it is less performant than join , but still …
Python also doesn’t refuse a = ‘b’ + ‘c’ for performance reasons!

pf_moore · January 25, 2020, 10:23am

Precisely because of the performance reason. String concatenation using addition is quadratic in the number of parts, whereas join is linear. That’s a significant issue, and using addition on large numbers of strings is a known anti-pattern. Allowing sum on strings was considered enough of an attractive nuisance that it should be explicitly blocked. From my recollection, that decision was made by Guido himself.

Having the “obvious” way to do something be significantly worse than an alternative, less-obvious, way is very much contrary to Python’s philosophy.

ruud · January 25, 2020, 11:07am

Wouldn’t it have been more logical to switch automatically to join-like functionality (and performance), once a string has been detected as the start value. That’s exactly the only time that sum is refused now.
The following code does exactly that:

def sum(iterable, start=0):
    if isinstance(start, str):
        return start + "".join(iterable)
    return sum(iterable, start)

That would make the language more consistent, IMHO.

And it is still backward compatible. Something for 3.9?

storchaka · January 25, 2020, 12:38pm

What is the benefit? For fixed number of values there is already existing syntax.

Instead of hypothetical sum(x, y, z) you can use x + y + z.
Instead of hypothetical list(x, y, z) you can use [x, y, z].
Instead of hypothetical tuple(x, y, z) you can use (x, y, z).
Instead of hypothetical set(x, y, z) you can use {x, y, z}.

pf_moore · January 25, 2020, 2:43pm

This has all been discussed previously. I suggest you check the python-dev archives for the discussions when sum() was first introduced. The decision to block strings was deliberate at the time, and as far as I am aware, none of the factors resulting in that decision have changed much since then.

You’re welcome to disagree with the conclusions, but if you want the function to be changed, you’ll have to persuade the core devs, which will involve addressing the factors raised then and explaining why things are different now.

ruud · January 25, 2020, 3:18pm

@pf_moore
Could you give a reference to the place(s) where this is issue was discussed among the Python devs?
So, I can study, the background of this decision.

Thanks.

pf_moore · January 25, 2020, 7:16pm

Sorry, no I don’t have one. You should be able to find the discussions using google against the python-dev archives.

ruud · January 25, 2020, 10:27pm

@pf_moore
Found the discussion. There are more people who share my reasoning, but it has been decided differently. So be it.

steven.daprano · January 25, 2020, 11:28pm

In reference to the performance issue, it should be pointed out just how
bad the performance of sum() on strings can be. Really, really bad.

To demonstrate this, we need a simple class that can fool the sum
function into allowing strings, and some timing code:

class ForceString:
    # We need to trick sum into adding strings.
    def __add__(self, other):
        return other

x = ForceString()
assert "a" + x == "a"

from timeit import Timer
setup = 'from __main__ import strings, x'
joinT = Timer('"".join(strings)', setup=setup)
sumT = Timer('sum(strings, x)', setup=setup)

For a small number of strings, sum isn’t too bad, only about 14 times
slower than join, give or take a bit:

# Tested on Python 3.8
> strings = ['abc']*100
> print('Join:', min(joinT.repeat(number=1000, repeat=5)))
Join: 0.01821363903582096
> print('Sum:', min(sumT.repeat(number=1000, repeat=5)))
Sum: 0.23563178814947605

But as the number of strings increases, the cost of sum increases even
faster. Increase the number of strings by a factor of 100, and sum is
1000 times slower:

> strings = ['abc']*10000
> print('Join:', min(joinT.repeat(number=1000, repeat=5)))
Join: 1.5848643388599157
> print('Sum:', min(sumT.repeat(number=1000, repeat=5)))
Sum: 1593.3930510450155

Increase the number of strings by another factor of 10, and sum is
around 8000 times slower:

> strings = ['abc']*100000
> print('Join:', min(joinT.repeat(number=1000, repeat=5)))
Join: 16.620423825457692
> print('Sum:', sumT.repeat(number=1, repeat=1)[0])
Sum: 135.0639129653573

(The raw numbers there need some care in interpretation: the join
version was run 1000 times for a total time of 16 seconds; the sum
version was run once for a time of 135 seconds. A faster computer will
help with the wall clock timings, but not the relative timings.)

Now it’s clear that the performance of sum is not precisely quadratic,
but it’s much worse than linear. To be honest, I don’t understand why
the performance isn’t quadratic: from theoretical reasoning, the final
example should be 14 million times slower than join, not a measly 8000
times slower

You should note also that sum will perform as poorly, or worse, when
summing anything where + means concatenation such as lists or tuples.

The conclusion we drew from this many years ago was to discourage people
from using sum() for concatenation. In practice, that means that summing
strings is the trap. Nobody is likely to accumulate a list of a billion
tuples and then try to flatten them into a single tuple with sum, but
people are going to try to concatenation a billion strings.

Hence sum() intentionally prevents the user from summing strings, but
doesn’t bother trying to prevent summing tuples, lists etc.

In a sense, this was a compromise between those who wanted the right to
shoot themselves in the foot with really slow repeated concatenation,
and those who wanted to protect the coder from accidentally writing
really slow code through ignorance. (“Performance was fine in testing,
but in production, it would sometimes drop to a crawl.”)

aeros · January 26, 2020, 12:32am

IIUC, the reason for the performance not being quadratic is due to a C-level optimization that occurs under the hood for string concatenation, in the function unicode_concatenate(). Specifically, if the string on the left side of the addition operation is no longer needed, it gets overwritten into the result of the concatenation (instead of allocating a new string).

Note: The above applies when using successive += to concatenate strings, but I’m not 100% certain that it applies to the @steven.daprano’s example.

ruud · January 26, 2020, 6:43am

@aeros steven.daprano
This benchmark still uses the repeated __add__ as applied in the current sum implementation.
But, why shouldn’t sum work differently when the start value is a string?
Like I suggested before:

def sum(iterable, start=0):
    if isinstance(start, str):
        return start + "".join(iterable)
    return sum(iterable, start)

I still haven’t seen anywhere what the disadvantage of this is.
And it performs as join, of course.

brandtbucher · January 26, 2020, 4:36pm

IIUC, the reason for the performance not being quadratic is due to a C-level optimization that occurs under the hood for string concatenation, in the function unicode_concatenate() .

I don’t think that’s the case; this shortcut is part of the main interpreter loop, and is only used for certain addition operations (+= or + between strings where, as you noted, the left operand is about to be tossed) in the Python layer. It works because we have such a great understanding of the current execution context.

The sum and str.join builtins work entirely in the C layer. As far as I know, the str implementation does have a lot of fine-tuned code, but nothing like this reference-counting-locals-inspecting sorcery!

brandtbucher · January 26, 2020, 4:52pm

I still haven’t seen anywhere what the disadvantage of this is.

See the related prior discussion at Issue 18305: [patch] Fast sum() for non-numbers - Python tracker. Specifically, the final comment.

It’s not that we don’t have the technology to do this efficiently. It’s just that the sum is a tool that is specialized to do one thing, and that one thing is very well understood by all users. Changing the implementation for sequence concatenation breaks that model.

Rather than special-casing sum in CPython, consider rewriting the code as a Python loop so that you can take advantage of the optimization that @aeros mentioned above! Or even better, just using str.join explicitly, like the error message suggests .