In reference to the performance issue, it should be pointed out just how
bad the performance of sum() on strings can be. Really, really bad.
To demonstrate this, we need a simple class that can fool the sum
function into allowing strings, and some timing code:
class ForceString:
# We need to trick sum into adding strings.
def __add__(self, other):
return other
x = ForceString()
assert "a" + x == "a"
from timeit import Timer
setup = 'from __main__ import strings, x'
joinT = Timer('"".join(strings)', setup=setup)
sumT = Timer('sum(strings, x)', setup=setup)
For a small number of strings, sum isn’t too bad, only about 14 times
slower than join, give or take a bit:
# Tested on Python 3.8
> strings = ['abc']*100
> print('Join:', min(joinT.repeat(number=1000, repeat=5)))
Join: 0.01821363903582096
> print('Sum:', min(sumT.repeat(number=1000, repeat=5)))
Sum: 0.23563178814947605
But as the number of strings increases, the cost of sum increases even
faster. Increase the number of strings by a factor of 100, and sum is
1000 times slower:
> strings = ['abc']*10000
> print('Join:', min(joinT.repeat(number=1000, repeat=5)))
Join: 1.5848643388599157
> print('Sum:', min(sumT.repeat(number=1000, repeat=5)))
Sum: 1593.3930510450155
Increase the number of strings by another factor of 10, and sum is
around 8000 times slower:
> strings = ['abc']*100000
> print('Join:', min(joinT.repeat(number=1000, repeat=5)))
Join: 16.620423825457692
> print('Sum:', sumT.repeat(number=1, repeat=1)[0])
Sum: 135.0639129653573
(The raw numbers there need some care in interpretation: the join
version was run 1000 times for a total time of 16 seconds; the sum
version was run once for a time of 135 seconds. A faster computer will
help with the wall clock timings, but not the relative timings.)
Now it’s clear that the performance of sum is not precisely quadratic,
but it’s much worse than linear. To be honest, I don’t understand why
the performance isn’t quadratic: from theoretical reasoning, the final
example should be 14 million times slower than join, not a measly 8000
times slower
You should note also that sum will perform as poorly, or worse, when
summing anything where +
means concatenation such as lists or tuples.
The conclusion we drew from this many years ago was to discourage people
from using sum() for concatenation. In practice, that means that summing
strings is the trap. Nobody is likely to accumulate a list of a billion
tuples and then try to flatten them into a single tuple with sum, but
people are going to try to concatenation a billion strings.
Hence sum() intentionally prevents the user from summing strings, but
doesn’t bother trying to prevent summing tuples, lists etc.
In a sense, this was a compromise between those who wanted the right to
shoot themselves in the foot with really slow repeated concatenation,
and those who wanted to protect the coder from accidentally writing
really slow code through ignorance. (“Performance was fine in testing,
but in production, it would sometimes drop to a crawl.”)