Is it possible to inspect a dict object to see if it has lost its string fastpath?

lor1113 · August 1, 2025, 9:32am

I’m currently trying to pin down a really weird performance regression between two commits in my code whereby a function that just does a basic dict read and insert is about 7x slower (0.122s vs 0.893s, both measured via cProfile) over the exact same set of inputs, on the exact same dict, with nothing that could really effect it changed.

I’m still trying to pin down the exact change that caused it, but I was initially curious, because I heard that python dicts have an inbuilt string fastpath method that is jettisoned for a generic implementation whenever a non-str access happens. (sorry if that is no longer the case, the info I found on it was quite old, I wasn’t able to find anything more up to date) The dict being used here is string keyed so I was curious if some weird interaction is causing the dict to lose the string fastpath and consequently slow down. Of course I have already inserted runtime checks to check for at every place the dict is read or written to and they didn’t raise any exceptions, but I’m running out of ideas.

So yeah, title, is it possible to inspect a dict object and see whether this substitution of fast str method for slow generic method has happened? How?

Thanks.

barry-scott · August 1, 2025, 3:22pm

I’m not clear what you are saying is a regression.

Are you saying that your code runs slower is you use cProfile?

cProfile makes python code run slowly, that’s expected.

lor1113 · August 1, 2025, 3:36pm

Sorry, I guess the post was unclear, I will edit it. The regression is between two versions of the code that have a fair quantity of changes inbetween (because I don’t performance profile the code with every code edit) but none that should have any impact on this function, which is unedited (it’s a very simple function that just takes in a key/value pair and inserts it into a dict, with some small logic if a value for that key is already present). I’m still trying to narrow down the exact change that triggers it, and I haven’t been able to make a more minimal version.

By “via cProfile” I just mentioned that both numbers are the measured total runtimes associated with that function from running both versions of the code via cProfile.

edit: in either case, I’m not asking for help with the performance regression directly (although if someone else was able to figure it out I’d obviously be grateful) it’s just context for my main question, which is the title of the post essentially

barry-scott · August 1, 2025, 4:02pm

You think that you may have added a non-string as a key and that caused the slow down.

Are you able to add an assert to check that all the keys you use on the dict are strings?

ngoldbaum · August 1, 2025, 4:25pm

You might try building CPython from source with debug symbols and then running your code under a sampling profiler. That will allow you to directly see where the interpreter is spending time.

I like using samply: GitHub - mstange/samply: Command-line sampling profiler for macOS, Linux, and Windows

On Linux and Python 3.12 and newer you can use the new perf intrumentation to get Python frames inside native profiles: Python support for the Linux perf profiler — Python 3.13.5 documentation

I wrote some docs about how to use samply with free-threaded Python here: Multithreaded Profiling with samply - Python Free-Threading Guide

lor1113 · August 1, 2025, 4:48pm

I did add a runtime check using isinstance(x, str) that didn’t seem to find anything

lor1113 · August 1, 2025, 4:53pm

I did try py-spy as a sampling profiler earlier, my problem with sampling profilers for this codebase is that the sample rate would have to be extremely high to get a good picture. The function in question is called 284020 times in a total program runtime of about 25s, but more realistically those calls are all clustered within maybe 10s of runtime. At the default samply sampling rate of 1000/s vs function call rate of 28402/s I’m not sure I’d get much useful from that.

I’ll have a look at linux perf and see if it gives me anything interesting, thanks for the tip. I am running on WSL so hopefully that’s compatible, I’ll have to check whether I have correctly built python however.

Thanks

oscarbenjamin · August 1, 2025, 5:05pm

I think that the answer to your main question is basically just “no” or at the very least “not easily”. In other words it is probably not a productive approach to investigating the regression.

No one would be able to figure out the regression based on the information provided but I can tell you what I would do in this situation which is to use git bisect to narrow down the changes as much as possible. I would definitely do that before spending any significant time looking into CPython internals.

lor1113 · August 1, 2025, 6:11pm

Oh yeah I have a selfhosted gitlab that this is hosted on and I’m using that to see the full changeset between the two commits at hand. Off work now so I’m going to try narrow things down some more.

Stefan2 · August 1, 2025, 8:30pm

I’d like to see that, please share the link.

Stefan2 · August 1, 2025, 8:46pm

I see the opposite, using a non-string key makes it faster:

d = {'a': 1, 'b': 2}
e = {'a': 1, 2: 'b'}

d["a"]   21.5 ± 0.1 ns 
e["a"]   18.1 ± 0.1 ns 

Python: 3.13.0 (main, Nov  9 2024, 10:04:25) [GCC 14.2.1 20240910]

benchmark script

from timeit import repeat

setup = '''
d = {'a': 1, 'b': 2}
e = {'a': 1, 2: 'b'}
'''
print(setup)

funcs = list('de')

from timeit import timeit
from statistics import mean, stdev
import sys
import random

times = {f: [] for f in funcs}
def stats(f):
    ts = [t * 1e9 for t in sorted(times[f])[:10]]
    return f'{mean(ts):5.1f} ± {stdev(ts):3.1f} ns '
for _ in range(1000):
    random.shuffle(funcs)
    for f in funcs:
        t = timeit(f'{f}["a"];' * 100, setup, number=10**3) / 10**5
        times[f].append(t)
for f in 'de':
    print(f'{f}["a"] ', stats(f))

print('\nPython:', sys.version)

Attempt This Online!

lor1113 · August 1, 2025, 8:47pm

I found it mentioned here: https://x.com/TedPetrou/status/969026757218070528 and here in more detail: l e w k . o r g : /python-dictionary-optimizations.

Looking at the current python dict code cpython/Objects/dictobject.c at main · python/cpython · GitHub it doesn’t seem to be quite the same but there is a lot of references to unicode lookups and unicode specific functions so string fastpaths of some kind do seem to still be present

lor1113 · August 1, 2025, 9:00pm

I don’t actually see that performance difference, on my machine they print identical. Curious as to why that might be - I have a more recent version of python but an older GCC.

Adding to your script a bit what I do see is that there still seems to be a fast path for looking up string keys, even if the key is very large, while very large ints do cause noticeably more performance degradation. it just doesn’t seem to require that the entire dict is string keyed, like it apparently used to.

Slightly edited script

Summary

setup = """
d = {'a': 1, 'b': 2}
e = {'a': 1, 2: 'b'}
f = {1: 1, 2: 'b'}
g = {'abcdefghijklmnopqrstuvwxyz1234567890abcdefghijklmnopqrstuvwxyz1234567890': 1, 2:'b'}
h = {12345678901234567890123456789012345678901234567890: 1, 2:'b'}
"""
print(setup)


funcs = {"d": '"a"', "e": '"a"', "f": 1, "g": '"abcdefghijklmnopqrstuvwxyz1234567890abcdefghijklmnopqrstuvwxyz1234567890"', "h": 12345678901234567890123456789012345678901234567890}

import random
import sys
from statistics import mean, stdev
from timeit import timeit

times = {f: [] for f in funcs}


def stats(f):
    ts = [t * 1e9 for t in sorted(times[f])[:10]]
    return f"{mean(ts):5.1f} ± {stdev(ts):3.1f} ns "


for _ in range(1000):
    func_items = list(funcs.items())
    random.shuffle(func_items)
    for f, v in func_items:
        t = timeit(f"{f}[{v}];" * 100, setup, number=10**3) / 10**5
        times[f].append(t)
for f, v in list(funcs.items()):
    print(f"{f}[{v}] ", stats(f))

print("\nPython:", sys.version)

My results:

d = {'a': 1, 'b': 2}
e = {'a': 1, 2: 'b'}
f = {1: 1, 2: 'b'}
g = {'abcdefghijklmnopqrstuvwxyz1234567890abcdefghijklmnopqrstuvwxyz1234567890': 1, 2:'b'}
h = {12345678901234567890123456789012345678901234567890: 1, 2:'b'}

d["a"]    7.1 ± 0.0 ns 
e["a"]    7.1 ± 0.0 ns 
f[1]    8.3 ± 0.1 ns 
g["abcdefghijklmnopqrstuvwxyz1234567890abcdefghijklmnopqrstuvwxyz1234567890"]    7.4 ± 0.1 ns 
h[12345678901234567890123456789012345678901234567890]   11.2 ± 0.1 ns 

Python: 3.13.3 (main, Apr  9 2025, 08:55:03) [GCC 13.3.0]

lor1113 · August 1, 2025, 10:32pm

I come with an update much more bizarre than I could have ever anticipated.

The regression does not come from my code in the slightest. Rather, it is an “issue” with cProfile. I had recently written a helper script for profiling this program. Normally beforehand I would just run
python -m cProfile -o test.prof x.py from the command line. This script instead grabbed the current git commit hash and used cProfile.run instead, saving the resulting file in a folder with the commit hash name for better recordkeeping.

(Relevant section of said script)

Summary

from x import run_main

...

cprofile_path = Path("profiling", "prof_" + commit + ".bin")
cprofile_path.unlink(True)
cProfile.run("run_main()", filename=cprofile_path.as_posix())

The end of x.py is

if __name__ == "__main__":
    out = run_main()

so this should really be identical to calling `cProfile x.py`

The difference in reported runtime is caused by whether you run cProfile via the module or via the command line. I would not believe it myself had I not just tested it 6 times, 3 for each, alternating each time. The measured cumtimes were 0.838, 0.122, 0.895, 0.127, 0.893, 0.125, so [0.122, 0.127,0.125] for running via the console and [0.838, 0.895, 0.893] for running via the module. The part that confuses me the most is that all other measured function times are extremely close between the two (running via command line vs running via module) - for example, to pick another large function that does not contain the regressing function in question, the 6 measured times were [5.23, 5.19, 5.18, 5.38, 5.25, 5.36] - all very close and within normal variation - which is why I did not pick up on this being a cProfile problem at first. Additionally all files agree on the ncalls of all functions that are part of my code, with some very slight variations in ncalls for inbuilt functions (I suspect it comes from slight randomness in set ordering due to hash randomization). Regardless the issue with the regressing function is clearly not an issue of randomness given the demonstrated consistency.

I guess thread kind of ends there, I could try and open a discussion or bug report about this on cProfile, but I have no clue how I would even start to create a minimally reproducible example, and I’m simply not willing to share the full codebase.