Allow a string as start for sum

ruud · January 27, 2020, 12:18pm

Recently I was participating in a thread where -on a side line- the sum function was discussed:

I was wondering (again) why sum does not allow a string as the start value of sum.
Obviously, the current implementation is that that raises a TypeError with a suggestion to use join instead.
I have looked at the discussion here Issue 18305: [patch] Fast sum() for non-numbers - Python tracker and find there, of course, the performance issue. But what if we would just delegate sum to join in case of a string as the start value.
The only problem might be then if we use a class that is inherited from str and that class has overridden the __add__ method. Well, that can be easily solved by checking whether this start variable uses the str.__add__ method or not.
So, in pseudocode, I propose:

def sum(iterable, start=0):
    if isinstance(start, str) and start.__add__ == str.__add__:
        return start + "".join(iterable)
    return orgsum(iterable, start)  # orgsum is like current sum without a TypeError for str's.

If we would implement sum like that we don’t need a TypeError when strings are used as start value and there is no need for a performance warning because it will just perform as quick as join.
I think, in all the discussions, nobody has ever come up with this implementation.

Therefore, I would like to propose a PEP for this. Is there a core developer willing and able to help me with that?

ericvsmith · January 27, 2020, 12:44pm

Say I had some code that used this version of sum() to concatenate stings, and suppose it took as a parameter the starting value. If someone called it with a str then it would work. But if the caller switched to a str subclass with a custom __add__, then it would raise a TypeError. That doesn’t seem like a great design, when instead it could use str.join() for the concatenation and work in both cases.

And what if the starting value were a str, but the values being summed were str subclasses with a custom __add__. They would be concatenated using join(), which doesn’t seem right. It seems you’d need to look at every items __add__ method, which in general is not possible.

I’d be more convinced if there was an algorithm that “summed” a bunch of things, and included a start value as input, and the same algorithm could operate on either numbers or strings, where adding the numbers made as much sense as concatenating the strings. But I’d be hard pressed to imagine such a thing.

So, I’m -1.

pf_moore · January 27, 2020, 1:07pm

For me, this is a good example of “explicit is better than implicit”. If you want join, use it. If you want sum (which I can’t imagine having any meaning other than “repeated addition”) then use it.

In practice, no-one should ever want to use repeated addition on strings, because of the performance issues, so not providing the “convenience” shortcut of sum, but requiring users who really do want repeated addition on strings to explicitly code it as a loop, seems reasonable.

It is of course always arguable whether preventing people from doing something “for their own good” is reasonable. But IMO this is not so much about limiting writers of code as it is about making the intent explicit for readers of the code. (And as we all know, “readability counts” )

ruud · January 27, 2020, 1:47pm

@ericvsmith
I think I was not clear enough in my pseudo code.

If I understand it right, sum now is more or less equivalent to

def sum(iterable, start=0):
    if isinstance(start, str):
        raise "TypeError('sum() can't sum strings [use ''.join(seq) instead]"
    for item in iterable:
        start = start + item
    return start

And I propose the following:

def sum(iterable, start=0):
    if isinstance(start, str) and start.__add__ == str.__add__:
        return start + "".join(iterable)
    for item in iterable:
        start = start + item
    return start

So, if a str subclassed start was used with a custom __add__ , it would just repeatedly add.

The case you describe where start is str and some or all of the items in the iterable are subclassed from start with a custom __add__ method will still work as expected as in this case for each addition str.__add__ will be used, and thus the result of a sequence of additions is exactly the same as join !

I hope this clarifies my intentions and the proposed change.

Are you still on -1?

ericvsmith · January 27, 2020, 1:51pm

Yes. This code is either going to call join when it shouldn’t, or be quadratic.

I’m sorry, but I just don’t see any practical benefit here.

ruud · January 27, 2020, 1:53pm

@pf_moore
You don’t seem to get my point.
If my solution would be accepted there is no need to prevent people from doing something “for their own good” as there is just no performance issue! Calling sum with a string as the start would simply have equivalent performance to join.
I can’t see any reason why we should prevent people from what seems a logical way of concatenating a number of strings.
The edge cases where start is a subclassed str is also properly handled, IMHO.

pf_moore · January 27, 2020, 2:37pm

You don’t seem to be getting my point either. It’s not about whether it can be made safe, it’s about being explicit about what the function is doing. “sum” means “repeated addition”. It does not mean “join with an empty separator”.

I’ve tried to construct examples of how your proposed function would work with string subclasses, and I’m forever getting confused as to whether I should expect __add__ or join to be called. That is not something I’d want to work with, and definitely not something I’d want as a builtin.

I guess it doesn’t matter much. I’m -1 on this, and if it’s to go anywhere, you need at least one core dev to support the idea. That won’t be me, so I’ll let you continue looking for someone to support it. If you do get someone, I’ll pick this up when the PEP is written (when I’ll argue for it to be rejected, as I imagine you’d expect )

brandtbucher · January 27, 2020, 4:26pm

sum is a very simple tool that performs a very simple task on a wide range of types. I’m quoting Guido here:

I ended up hating reduce() because it was almost exclusively used (a) to implement sum() , or (b) to write unreadable code. So we added built-in sum() at the same time we demoted reduce()…

It’s simple to think of sum(seq, start) as functools.reduce(operator.add, seq, start) because that is precisely what it was created to replace.

If sum were to adopt special str.join semantics for “strings without an overridden __add__”, then that opens the function up to further and further special casing for other types:

Why give an “incorrect” result for the summation of a sequence of floats? We should make sum behave like math.fsum in this case! What if we want to skip NaNs? What about the silently inefficient use of sum to join lists/tuples/Counters/whatever? We should special-case these!

Changing the implementation for str like this, I feel, is an incorrect delegation of responsibility. The implementation of summation should be part of the object, or a helper function, not part of a general-purpose utility. This is better for usability, readability, and maintainability, in the long run.

(I mentioned in the other thread that the current TypeError strikes me as a bit-heavy handed, but I don’t think that it’s a bad thing to have in light of the quick, one-time education it provides the user.)

ruud · January 27, 2020, 4:44pm

I am afraid that I didn’t express myself clearly enough.
My proposal has nothing to do with opening sum to behave differently.
All I want to do is, get rid of the exclusion str as start. The reason that this was done was for performance reasons, which is a good thing.
With my solution, however, that’s not a valid reason anymore.
Although I do a join operation when start is str, the effect is exactly the same as applying repeatedly the __add__ method (and that only holds for true str types, of course).
So, using join in that case is just an implementation detail to improve performance in that case.
Does this make my reasoning clearer?
I don’t want to open up sum for any other special case, at all.
All I want, is to alloiw strings to be used as start, which seems more in line with Python’s consistency philosophy than not allowing it.

brandtbucher · January 27, 2020, 5:11pm

I think the following example could clear up some of the issues here:

>>> class Spammer(str):
...     def __radd__(self, other):
...         return "SPAM!!!"
...         
>>> "ham" + "ham" + "ham" + Spammer()
'SPAM!!!'
>>> "".join(["ham", "ham", "ham", Spammer()])
'hamhamham'

What would sum(["ham", "ham", "ham", Spammer()], "") return? According to your implementation, it would return "hamhamham". But according to everyone’s mental model of sum, it should return "SPAM!!!". And this should be the case whether Spammer is a str subclass or not.

You’ve repeatedly insisted that you’re not making sum behave differently, but you are. You’ve also said that you’re not opening it up to further special-casing more performant behavior (lists and tuples, for instance), but you are. This was debated and decided years ago by respected core developers, in the issue you linked to.

When others raise opposing arguments, it seems that rather than respond to them, you just insist that we don’t get what you’re saying. We’ve seen your implementation in pseudocode several times though, so it’s fair to assume that we’re all on the same page regarding your desired behavior!

I am probably done here, as well.

ruud · January 27, 2020, 5:37pm

@brandtbucher
I am very sorry to say that I had missed the possibility that someone could have overridden the __radd__ method. And of course, I can’t detect that.
That (and only that) makes my implementation useless.
Sorry, to have bothered you all. Please accept my apologies.

holdenweb · February 3, 2020, 7:35pm

We’ve all been there. Thanks for accepting the conclusion once things were explained. And thanks to all those who explained in their various ways, all of which informed me. I remember Alex Martelli addressing this issue back in 2003, but alas the Internet is too small to retain his arguments. They were apparently lost but may be retrievable from archives, should it matter, since according to Dr. Dobb's Python-URL! [LWN.net] says:

Alex Martelli explains why the new sum() built-in doesn't do strings.
   <http://groups.google.com/groups?th=36c124ddab97e1a#link5>

Others similarly burdened by years may remember the python-url fondly, as I do.

aroberge · February 3, 2020, 9:00pm

Could it be part of the following discussion? https://grokbase.com/t/python/python-list/036kqh3v28/sum-strings