Optimise string for single-element slice? (Answer: probably not)

Hi,

I recently came across some code that read my_string[:1], which seem like an odd idiom for my_string[0]. I thought I’d measure how much worse slicing was compared to single-element access, and these are my local results:

$> python3.13 -m timeit '"this is a string"[:1]'
10000000 loops, best of 5: 27.3 nsec per loop
$> python3.13 -m timeit '"this is a string"[0]'
50000000 loops, best of 5: 7.73 nsec per loop

The question is: do people think it be would worth adding such an optimisation to the unicodeobject.c module? There already is some special casing for slices that represent the full string, so there is some precedent

The first generates a string, which afaik requires creating a new string, while the second one just requires creating an integer right? I think a more fair comparison would be to do str("string"[0]), but I might be wrong.

They’re both strings:

>>> s = 'hello'
>>> type(s[0])
<class 'str'>
>>> type(s[:1])
<class 'str'>
1 Like

I think this is a case where the proper fix is to make sure people are aware that they shouldn’t do s[:1] or s[-1:] and use s[0] or s[-1] instead. It’s both clearer to read as well as more performant.

In your second example the result is calculated at the compile time. It is equivalent to

$> python3.13 -m timeit '"t"'

or even

$> python3.13 -m timeit 'pass'
3 Likes

I’ve not written enough Python recently, forgot that str is a list of str. Thanks for reminding me!

1 Like

Here’s the dissasembly for who’s curious:

>>> import dis
>>> dis.dis('"foo"[0]')
  1           0 LOAD_CONST               0 ('f')
              2 RETURN_VALUE
>>> dis.dis('"foo"[:1]')
  1           0 LOAD_CONST               0 ('foo')
              2 LOAD_CONST               1 (None)
              4 LOAD_CONST               2 (1)
              6 BUILD_SLICE              2
              8 BINARY_SUBSCR
             10 RETURN_VALUE
1 Like

They probably did it to handle empty strings.

15 Likes

There is still a difference when using a str:

In [1]: s = "this is a string"

In [2]: %timeit s[1]
9.66 ns ± 0.212 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)

In [3]: %timeit s[1:2]
22.9 ns ± 1.63 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)

In [4]: import dis

In [5]: dis.dis('s[1]')
  0           RESUME                   0

  1           LOAD_NAME                0 (s)
              LOAD_CONST               0 (1)
              BINARY_SUBSCR
              RETURN_VALUE

In [6]: dis.dis('s[1:2]')
  0           RESUME                   0

  1           LOAD_NAME                0 (s)
              LOAD_CONST               0 (1)
              LOAD_CONST               1 (2)
              BINARY_SLICE
              RETURN_VALUE

The code that distinguishes these is here:

There are two special cases that avoid calling PyUnicode_Substring: s[:] and s[:0]

In [9]: %timeit s[:]
18.1 ns ± 2.32 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)

In [10]: %timeit s[:0]
20.8 ns ± 1.26 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)

So you could maybe shave off 2-4ns by checking for slicelength == 1, but you wouldn’t get down to s[i] speeds. Doesn’t seem worth it to me.

3 Likes

Yeah, I’ve used x[:1] == c to handle empty strings. It’s faster and shorter than x and x[0] == c or x.startswith(c). It’s a useful idiom when parsing data and making it eve a tiny bit faster would be nice.

3 Likes

More times (though with older Python):

  0.7 ± 0.1 ns  'abc'[0]
  0.7 ± 0.1 ns  pass
 16.5 ± 0.0 ns  s[0]
 22.1 ± 0.1 ns  s[sl]
 22.5 ± 0.3 ns  s and s[0]
 23.4 ± 0.3 ns  s[0] if s else ''
 29.3 ± 0.1 ns  s[:0]
 29.4 ± 0.1 ns  s[:]
 32.6 ± 0.1 ns  s[:1]

Python: 3.11.9 (main, Apr  2 2024, 08:25:04) [GCC 13.3.0]

With s = 'abc' and sl = slice(1) (to save the time for building the slice object every time).

Code
from timeit import timeit
from statistics import mean, stdev
import sys
import random

setup = '''
s = 'abc'
sl = slice(1)
'''

funcs = ['s[0]', 's[:1]', 's[sl]', "'abc'[0]", 'pass', "s[0] if s else ''", 's and s[0]', 's[:]', 's[:0]']

times = {f: [] for f in funcs}
def stats(f):
    ts = [t * 1e9 for t in sorted(times[f])[:5]]
    return f'{mean(ts):5.1f} ± {stdev(ts):3.1f} ns '
for _ in range(25):
    random.shuffle(funcs)
    for f in funcs:
        t = timeit(f'{f};'*10, setup, number=10**5) / 1e6
        times[f].append(t)
for f in sorted(funcs, key=stats):
    print(stats(f), f)

print('\nPython:', sys.version)

I do wonder if we could make slices constants. By the looks of it, it would speed this up by 10 nanoseconds.

Perhaps some set of them could be kept as singletons, similar to the smaller integers, but it seems pretty difficult to identify the most useful set?

I suppose an easy first set is [1:], [:1], [:-1], [::-1], and similar. But saving 10ns in a few cases is a very small gain.

Seems done (for 3:14?):

2 Likes

Lol, I commented on that. :slight_smile: Would someone like to benchmark this on a build of 3.14?

1 Like

Tried my script on a fresh CPython GitHub codespace after this:

dnf install compiler-rt
./configure --enable-optimizations
make -s -j

Results:

  0.9 ± 0.0 ns  pass
  0.9 ± 0.0 ns  'abc'[0]
 10.6 ± 0.0 ns  s[0]
 16.2 ± 0.0 ns  s[0] if s else ''
 18.1 ± 0.0 ns  s and s[0]
 24.8 ± 0.0 ns  s[:]
 30.1 ± 0.2 ns  s[:0]
 37.0 ± 0.2 ns  s[sl]
 38.6 ± 0.1 ns  s[:1]

Python: 3.14.0a1+ (heads/main:a83472f49b9, Nov 12 2024, 18:57:03) [Clang 18.1.8 (Fedora 18.1.8-1.fc40)]
1 Like

Here are the results for s[:1] on different Python versions:

53.4 ± 0.1 ns  Python: 3.9.20
51.9 ± 0.2 ns  Python: 3.10.15
50.3 ± 0.0 ns  Python: 3.11.10
44.5 ± 0.0 ns  Python: 3.12.7
43.2 ± 0.0 ns  Python: 3.13.0
29.5 ± 0.1 ns  Python: 3.14.0a1+

That’s a respectable 1.81x faster than Python 3.9.

4 Likes

Firstly, thanks everyone for chiming in, it was an interesting discussion!

That’s indeed the code I referred to when I posted originally and said that some special cases are being handled already. Your analysis seems the most relevant to me, as it predicts how much gain we could achieve if this optimisation was to be implemented, and like you say it’s not much really.

This is what I tried now after considering everyone’s feedback:

In [1]: import sys
In [2]: s = sys.argv[0]
In [3]: single_element_slice = slice(0, 1)
In [4]: null_slice = slice(1,1)
In [5]: null_slice2 = slice(0, 0)
In [6]: full_slice = slice(0, len(s))
In [7]: %timeit s[0]
15.9 ns ± 0.0599 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)
In [8]: %timeit s[0]
16 ns ± 0.0257 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)
In [9]: %timeit s[1]
16.2 ns ± 0.277 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)
In [10]: %timeit s[null_slice]
22 ns ± 0.0195 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [11]: %timeit s[null_slice2]
22 ns ± 0.0241 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [12]: %timeit s[single_element_slice]
22.5 ns ± 0.0165 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [13]: %timeit s[full_slice]
26.6 ns ± 0.0355 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

The best-case scenario for slice-based access is the 0-length slices, so those indicate the overhead of parsing the slice object, etc, before constructing any new str object. So, a 1-element slicing operation that (locally) takes only half a nano-second more in average than that isn’t too bad actually :slight_smile:

Indeed, I hadn’t realised my initial analysis was flawed by using a constant. I also hadn’t considered the overhead of re-creating the slice object itself. When using a variable things get worse indeed for the direct indexing case, and reusing the slice also seems to make a difference, albeit less noticeable, so maybe it’s just noise at that point.

Indeed the code was part of a parser, but by the time the code was hit the string was known to be not empty, hence direct indexing into the first element made more sense to me. I figured direct indexing was faster, and while it is the gain isn’t much.

2 Likes