String concatenation with a for loop

StrongAlly · November 26, 2024, 6:45pm

Hello everyone,

When I run this program on Python 3.12.6, debian trixie:

# Part 1
def test():
    s = 'a'*1_000_000
    t = ''
    for c in s:
        print(len(t), id(t))
        t += c

test()

##############################################
# Part 2

s = 'a'*1_000_000
t = ''
for c in s:
    print(len(t), id(t))
    t += c

part 1 is much faster than part 2. It seems that in part 2, Python copies the string at each step.
What is the reason for this phenomenon?

Thanks !

barry-scott · November 26, 2024, 6:53pm

My guess is that its the cost of access to local variables, part 1, compared to the cost of access to global variables in part 2.

In both cases the string manipulation of t += c is the same.
Note that each time you call t += c python actually does t = t + c that creates a new string to replace t every time around the loop.

MegaIng · November 26, 2024, 7:07pm

CPython actually does specific optimizations for string concatenation that means it tries to reuse the original object and avoid copying. I suspect this doesn’t always work, but even if it only works sometimes it can be a huge difference.

This can only work if the reference count of the object is exactly 1. I suspect that with the changed behavior of global variable accesses compared function locals this optimization is also prevented, in addition to what @barry-scott already mentioned of global variables being slower to access than function locals.

Stefan2 · November 26, 2024, 7:26pm

Please don’t just say “much faster” but say how much faster. And don’t print in there.

Stefan2 · November 26, 2024, 7:29pm

Explanation:

github.com/python/cpython

Slower string concatenation in CPython 3.11

opened 12:49AM - 29 Nov 22 UTC

closed 05:38PM - 21 Dec 22 UTC

zephyr111

type-bug performance

Hello, We have found a regression between CPython 3.10.8 and CPython 3.11 res…ulting in string concatenation to be significantly slower in loops on Windows 10. This is described in details in this [StackOverflow post](https://stackoverflow.com/questions/74605279/python-3-11-worse-optimized-than-3-10/74607850). Here is a minimal, reproducible example of benchmarking code: ```python import time a = 'a' start = time.time() for _ in range(1000000): a += 'a' end = time.time() print(a[:5], (end-start) * 1000) ``` CPython 3.11.0 is about 100 times slower than CPython 3.10.8 due to a quadratic running time (as opposed to a linear running time for CPython 3.10.8). The analysis shows that CPython 3.10.8 was generating an INPLACE_ADD instruction so `PyUnicode_Append` is called at runtime, while CPython 3.11.0 new generates a BINARY_OP instruction so `PyUnicode_Concat` is actually called. The later function creates a new bigger string reducing drastically the performance of the string appending loop in the provided code. This appears to be related to the issue #89799 . I think if we want to replace INPLACE_ADD with a BINARY_OP, then an optimization checking the number of references (so to eventually do an in-place operation) is missing in the code of CPython 3.11.0. What do you think about it? My environment is an embedded CPython 3.10.8 and an embedded CPython 3.11.0, both running on Windows 10 (22H2) with a x86-64 processor (i5-9600KF).

elis.byberi · November 26, 2024, 7:41pm

It is not related to strings, but rather to the speed of accessing a local variable versus a global variable.

import time


def bench():
    a = 0
    for i in range(1000000):
        a += i

    return a


start = time.time()
result = bench()

print(time.time() - start)
print(result)

start = time.time()
a = 0
for i in range(1000000):
    a += i

print(time.time() - start)
print(result)

0.06985592842102051
499999500000
0.1436767578125
499999500000

Stefan2 · November 26, 2024, 8:35pm

@elis.byberi Did did you have a look at the GitHub issue with the explanation? And why did you measure ints instead of strings? That doesn’t show the huge difference.

StrongAlly · November 26, 2024, 9:00pm

0.03 seconds for part 1 versus 15 seconds for part 2.
The prints highlight the fact that the string t is copied at each step in part 2, while this is not the case in part 1.

The link given by Stephan answers my question, thanks !

elis.byberi · November 26, 2024, 9:17pm

String: global vs local

import time


def bench_local():
    s = 'a' * 100_000
    t = ''
    for c in s:
        t += c

    return len(t)


start = time.time()
result = bench_local()

print(time.time() - start)
print(result)

s = 'a' * 100_000
t = ''


def bench_global():
    global s
    global t
    for c in s:
        t += c

    return len(t)


start = time.time()
result = bench_global()

print(time.time() - start)
print(result)

Python 3.10.12:
0.009384393692016602
100000
0.36257052421569824
100000

Python 3.12.7:
0.009612798690795898
100000
0.3437345027923584
100000

Local: string vs integer

import time


def bench_string():
    s = 'a' * 1000000
    t = ''
    for c in s:
        t += c

    return len(t)


start = time.time()
result = bench_string()

print(time.time() - start)
print(result)


def bench_integer():
    a = 0
    for i in range(1000000):
        a += i

    return a


start = time.time()
result = bench_integer()

print(time.time() - start)
print(result)

Python 3.10.12:
0.08484053611755371
1000000
0.06541633605957031
499999500000

Python 3.12.7:
0.06548666954040527
1000000
0.05298948287963867
499999500000

Stefan2 · November 26, 2024, 11:52pm

Ah right, for that it’s good. Just not for timing.

elis.byberi · November 27, 2024, 12:06am

That doesn’t seem to cause a significant slowdown. In Python 3.10, there are no copies, but the slowdown remains the same:

# Part 1
def test():
    s = 'a' * 1_000_000
    t = 'b' * 1_000_000
    for c in s[:10]:
        print(len(t), id(t))
        t += c


test()

##############################################
# Part 2

s = 'a' * 1_000_000
t = 'b' * 1_000_000
for c in s[:10]:
    print(len(t), id(t))
    t += c

Python 3.10.12:
1000000 126645204770832
1000001 126645204770832
1000002 126645204770832
1000003 126645204770832
1000004 126645204770832
1000005 126645204770832
1000006 126645204770832
1000007 126645204770832
1000008 126645204770832
1000009 126645204770832
1000000 98116411510800
1000001 98116411510800
1000002 98116411510800
1000003 98116411510800
1000004 98116411510800
1000005 98116411510800
1000006 98116411510800
1000007 98116411510800
1000008 98116411510800
1000009 98116411510800

Python 3.12.7:
1000000 129595043692560
1000001 129595042689040
1000002 129595042689040
1000003 129595042689040
1000004 129595042689040
1000005 129595042689040
1000006 129595042689040
1000007 129595042689040
1000008 129595042689040
1000009 129595042689040
1000000 37507376
1000001 38507440
1000002 37507376
1000003 38507440
1000004 37507376
1000005 38507440
1000006 37507376
1000007 38507440
1000008 37507376
1000009 38507440

Stefan2 · November 27, 2024, 12:29am

@elis.byberi In your benchmark, you used a function with global. In your id logger, you didn’t.

elis.byberi · November 27, 2024, 12:36am

I don’t quite understand what you mean. Could you elaborate?

Stefan2 · November 27, 2024, 12:43am

This:

def bench_global():
    global s
    global t
    for c in s:
        t += c

Vs this:

s = 'a' * 1_000_000
t = 'b' * 1_000_000
for c in s[:10]:
    print(len(t), id(t))
    t += c

Try the latter with global t in a function like in the benchmark. What output do you get?

elis.byberi · November 27, 2024, 12:58am

It shows a similar output for both Python 3.10 and Python 3.12, but does not affect timing.

chepner · November 27, 2024, 12:57pm

The reason is implementation-specific, and of primarily academic interest. Python provides no guarantee that something like

for c in s:
    print(len(t), id(t))
    t += c

won’t take quadratic time. It does guarantee that

for c in s:
    print(len(t), id(t))
t = ''.join(s)

will take linear time, so avoid repeated concatenation.

da-dada · November 27, 2024, 8:57pm

better locals versus non locals

import time
s = 'a'*1_000_000

def testo():
    t = ''
    def testi():
        start = time.time()
        nonlocal t
        for c in s:
            t += c
        print(str(time.time() - start))
    testi()

testo()

I think