Shrink USABLE_FRACTION for performance (ideas for saving space elsewhere to come later)

Anyone here who follows my posts closely, may have noticed that I’ve been studying the Python dict implementation quite intensely the last few days.[1] I have essentially two ideas for improvements that could save a fair bit of space (varying amounts from 10-50%, with 25% typical for smallish dicts) above and beyond recent improvements, with what I believe would be minimal impact on performance. But I’ll present those separately. For the moment, I want to talk about trading off some space for performance, since so much has already been saved.

Some background: the USABLE_FRACTION macro in the CPython code governs the amount of space that can conceptually be filled in the hash table powering a Python dict (the size of which hash table is always a power of two). Essentially it multiplies the table size by a load factor, currently two-thirds. Thus, when the first key is added to an empty dict, and it’s given the smallest allowed hash table (8 entries), that dict can hold up to 5 items before it will be rehashed and the table resized.

Since the table size doubles whenever it’s resized, and this is triggered by reaching a density of about 2/3, it stands to reason that the new table has a density of about 1/3. Approximating that dicts of each size are equally common across any given small range of sizes, the average density is 1/2.[2].

Before the Python 3.6 optimization, this meant that a dict with a table size of, say, 128, would use 384 words for that table (a key pointer, value pointer and hash value for each table entry), plus whatever header, and on average represent 64 key-value pairs. And that was the tradeoff for a while. Some other languages offered hash tables that defaulted to a higher maximum load factor (from what I can tell, Java used 0.75), and perhaps offered some customization.

But now, we can mostly collapse the empty entries in the table as far as a single byte, and 64-bit architectures are common. The same dict will use 255+16 = 271 words on a 64-bit machine. Even though the table’s average load is nominally ~1/2, the actual storage for full-sized table entries averages ~3/4 full.

Not to mention the other optimizations available: setting aside shared-key dicts for a moment, dicts that have only string keys get a special optimization that drops the hashes out of the table (instead using a hash carried with the string itself) - so they would use only 170+16 = 186 words. That’s less than half the original amount.

So at this point I’d like to propose setting the maximum load factor to half what it currently is - in practical terms, doubling each “table index” size, but leaving “table entry” sizes alone. This causes a hiccup for dicts with 86…170 keys, since they get shifted up a “shoebox size” and need to use 16-bit values for indices (so the “table index” storage space actually quadruples). However, I think this is well worthwhile. The worst-case increase in space used is about 26% ((340+H+128)/(340+H+32), where H is the relatively small number of words of space in the dict headers etc.). The normal cases are more like 6% ((255+H+32)/(255+H+16)).

Plus, again, I have more possible space optimizations in mind - why not give some of that savings back, if the performance can improve?

So let’s examine that: when the table is 2/3 full, simple math shows it takes an average of 3 attempts to find an empty slot with random[3] probing. At 1/3, the average number of attempts is only half that, at 1.5. More advanced math can give us an average across that range assuming a uniform distribution. With load varying from 1/3 to 2/3, we can say that the average number of probes is about 2.08.[4] If we lower the maximum load to 1/3, on the other hand, then it will range from 1/6 to 1/3 by the previous argument, and thus the average number of probes falls to about 1.34 : about a 36% improvement.

Of course, the returns are rapidly diminishing[5], and the time spent inserting a key depends on more than just the number of attempts to find a place for it[6]. So I’m not proposing to go any further than that. But this seems to me like an easy win.


  1. In fact, this has been part of an effort to design a combined sequence/mapping data structure for my own programming language. ↩︎

  2. Of course, it’s not possible to have a uniform distribution over an unbounded interval. In practical terms, a dictionary’s maximum size is limited by memory, as well as by the type used to record the hash table size. But more importantly, in practice the distribution is not uniform, but weighted towards smaller dicts. One can imagine that the actual average density, in practice, is slightly lower than the naive calculation gives. ↩︎

  3. Of course, probing does need to be deterministically based on the hash, but it should ideally have the characteristic of a random probe - except slightly better, since it should avoid trying the same slot twice. ↩︎

  4. Again, slightly less if we account for a practical distribution rather than a uniform one. But the same applies regardless of load factor; to a first approximation I will say that it cancels out. ↩︎

  5. Obviously, it takes at least 1 attempt on average. ↩︎

  6. The overall impact would need to be benchmarked, of course. ↩︎

2 Likes

Historically, I tried 1/2 but it doesn’t show significant performance improvements. That’s why I kept using 1/2.

Would you confirm shrinking usable_fraction has really performance benefit?

2 Likes

I’m trying to set up a release build in order to test performance properly. But when I try to make again after the changes, I get a bunch of errors claiming that source locations for function <name> have changed, the profile data may be out of date. It still happens if I redo ./configure --enable-optimizations --with-lto first. I thought that the makefile should be able to rebuild the profile data automatically. How do I proceed?

Run make clean and then make again.

That doesn’t seem to work either. I get:

Building with support for profile generation:
malloc(): corrupted top sizemalloc(): corrupted top sizemalloc(): corrupted top size


malloc(): corrupted top size
Aborted (core dumped)
make[2]: *** [Makefile:1329: Python/frozen_modules/importlib._bootstrap_external.h] Error 134
make[2]: *** Waiting for unfinished jobs....
Aborted (core dumped)
make[2]: *** [Makefile:1332: Python/frozen_modules/zipimport.h] Error 134
Aborted (core dumped)
make[2]: *** [Makefile:1321: Python/frozen_modules/getpath.h] Error 134
Aborted (core dumped)
make[2]: *** [Makefile:1326: Python/frozen_modules/importlib._bootstrap.h] Error 134
make[1]: *** [Makefile:803: profile-gen-stamp] Error 2
make: *** [Makefile:815: profile-run-stamp] Error 2

I tried again starting fresh from a GitHub checkout but I still hit this error. This is coming from the build process, not runtime. I can’t understand how a simple tweak like this could cause such a problem. (Maybe my specific GCC version has a problem with LTO?)

Could it be that there is a different place somewhere calculating a size value again without using this macro? Or maybe you need to update the magic number for pyc files… For some reason.

1 Like

It seems like there’s something along those lines. The problem is very reproducible (it’s even the same filed mentioned) and it doesn’t seem to matter whether I make clean or git clean. (In fact if I don’t add a line in that one comment, I don’t get any errors from the profile data and I can proceed straight to the same build failure.) But getpath is the only one of the problematic frozen modules that seems to have any C code pre-freeze, and I couldn’t see anything suspicious in there - while it’s doing a lot of dictionary building, it goes through the normal PyDict_SetItemString etc. API.

… It did occur to me that there’s “a different place somewhere” in the same file: if I halve USABLE_FRACTION results then I also need to double GROWTH_RATE results - because it’s used directly with calculate_log2_keysize (but I also need the corresponding change to estimate_log2_keysize, because that’s used for pre-sized dictionaries such as from dict.fromkeys).

(Incidentally, all these names seem inaccurate, but I’m not really sure what’s better.)

I was just getting thrown off because I hadn’t fathomed that Python dicts would be used during the build process - as part of creating frozen modules.

… But now I have a new issue, seemingly even further in:

$ make -j -s
Rebuilding with profile guided optimizations:
Segmentation fault (core dumped)
make[1]: *** [Makefile:1413: Python/deepfreeze/deepfreeze.c] Error 139
make: *** [Makefile:833: profile-opt] Error 2

Sigh…