I’m trying to put together a really complete, detailed technical understanding of Python dict implementation and how they’ve varied across Python versions - so that I can do a writeup and have all the information in one place.
So far, I have:
- refreshed my general understanding of hash tables (it’s been 20+ years since I studied this in university)
- referred to a Stack Overflow Q&A, @rhettinger 's original proof-of-concept and proposal for the new implementation in 3.6, and some issue tracker entries such as GROWTH_RATE prevents dict shrinking · Issue #77386 · python/cpython
- looked through several versions of the relevant source code, along with
dictnotes.txt(although the latter seems to be out of date and didn’t really clarify much for me) - written some test code to investigate some interesting scenarios in different Python versions, and put it on GitHub
However, there have been details I haven’t been able to figure out. I’ll explain my aims with the test code and what I think I’ve demonstrated so far, and ask some questions along the way.
First: across many Python versions it has been a constant, as far as I can tell, that the smallest available hash table has 8 entries, and that Python will rehash when an attempt is made to use the 6th of these. This is because of the USABLE_FRACTION macro, which seems to have been left alone and which implies that (8 << 1) / 3, i. e. 5 of the table slots may be used.
My tests, therefore, first check the sys.getsizeof a few specific example dictionaries: an empty one, ones with 5 elements (as many as will fit before the first resize), and ones with 10 elements (as many as should fit before the second resize - keep reading). They also try both strings (single characters 'a', 'b' etc.) and numbers (small integers 0, 1 etc.) as keys.
As some general notes: the results I show in the README on GitHub are from 64-bit builds. I am, for the most part, not considering shared-key dictionaries here.
Observations and questions (in bold), part 1:
- The implementation seems to change at least a little bit in most minor versions.
- (Not reflected in the test code) It seems to make no difference whether keys are added one at a time or all at once (e.g. by
dict.fromkeys). This is a little strange to me, because from what I could understand about the discussion aroundGROWTH_RATE, this is supposed to be explicitly considered. (Otherwise: why not just count the non-dummy entries, and choose the power-of-two size that makes the density somewhere between 1/3 and 2/3?) - (Not reflected in the test code) Python 3.5 seems to resize the table in the normal way, from 8 slots to 16; but a 16-slot table will hold 11 elements rather than 10 (since the
USABLE_FRACTIONmacro will round down) as in all other versions. What happened? - Python 2.7 seems to resize the table by a factor of 4 instead of doubling it. Some of what I could find about
GROWTH_RATEseems to imply that the table gets resized twice in a row; but I couldn’t understand how the calculation ofGROWTH_RATEreally works, nor how it’s applied in order to resize the table. Anyway, this of course means that tables created with up to 21 keys (assuming no deletions) will be the same size. Was this intentional? If so, why was it changed back? - But even weirder than that: the 2.7 dict size jumps from 280 bytes to 1048. In other words, 768 more bytes are taken up - enough for 32 more table slots (in the old scheme). This is bizarre to me - it’s as if the original 8 slots are held onto, and wasted. Why? This doesn’t happen with future resizes: e.g. the next jump in size is from 1048 to 3320 (96 more slots, going from 32 to 128 - another quadrupling).
- In 3.8, empty dicts finally stop taking as much space as 5-element dicts. It appears that they not only save the 128 bytes needed for 5 “entries” plus 8 “indices”, but in fact drop the entire
PyDictKeysObjectand leave thePyDictObjectwith a null pointer. Was using the space a conscious decision before, or just an oversight? - Compared to Python 3.8, non-empty dicts in 3.6 were 8 bytes larger, and in 3.7 16 bytes larger. Why? Is this extra space in the
PyDictObject, or inPyDictKeysObject(or both in 3.7)? I can’t check code versions this old on GitHub, of course. - Starting in 3.11, we see an optimization for dicts whose keys are all strings. From what I can tell in the source code: a different entry type is used which omits the hash, and the implementation instead relies on a cached hash value in the string objects. Is that right? Also: I thought I had heard about dicts being optimized for string keys in much older versions of Python. Did I just imagine that? Maybe I had it confused with the key-sharing optimization (since both of these would be relevant to the
__dict__of instances of the same class). - Also in 3.11, the fixed-size portion of
PyDictKeysObjectbecomes 8 bytes smaller by e.g. taking advantage of the fact that the table size will be a power of two (and thus storing a smaller table-size number in a smaller type). Nice. It also looks likePyDictKeysObjectgained a 4-bytedk_versionwhich seems intended to replace the 8-bytema_version_taginPyDictObject(deprecated in 3.12). Is this used for detecting modifications during iteration? I couldn’t understand the description in the source comments about how the field is updated. Or else just what is it for? - It seems that the combined overhead of the miscellaneous “bookkeeping” bits - everything aside from the actual table entries (and table indices, for the new implementation) was less before: 88 bytes in 2.7, 96 in 3.5; 104 bytes today (and as much as 120 bytes in 3.7). What sorts of things changed? In particular, was the
PyDictKeysObjectdata inlined before or something? (the shared-key optimization seems to exist in 3.5, but not in 2.7. I think that optimization requiresPyDictKeysObjects to exist separately, but I’m not sure.)
In the second part, I try something more involved, but which gives a much simpler diagnostic. I create a dict with 5 keys, delete some number of them, then attempt to add a key, and see if the dict gets bigger. Generally, deleting a key from the dict needs to leave behind a “dummy” in the table (so that keys that were inserted later and had a hash collision, can still be found - as the dummy will signal to keep looking). So that means in general we still have to rehash even if we nominally “made room”.
- In 3.5, the behaviour is not consistent between runs. What’s going on, exactly? It seems like there’s some attempt to check whether the new key could replace a dummy, before deciding to resize. If that’s it, why don’t other versions seem to do this too? I also remember something about the hash function having some kind of random seed to avoid denial-of-service attacks (by providing a large amount of keys in input data that would be known to collide). Is that relevant here?
- In 3.6 through 3.9, if we remove at least three keys before adding a new one, the dictionary consistently ends up taking the same amount of space as before. My guess is: the rehash happens regardless, but since dummies aren’t included when rehashing (there’s no reason to), the number of entries being hashed is now small enough that Python decides to allocate the same amount of space as before (instead of doubling the size). Is that right?
- Starting in 3.10, the old 2.7 behaviour appears to come back: unless the table is completely emptied, adding a new key will force a rehash and a size increase. Is there a good reason for this, or should it be considered a regression?
- Speaking of which: is emptying a dict completely special-cased somehow? In my tests, I
deleach key one at a time; so if there’s a different code path for empty dicts (rather, ones where every table entry is a dummy), it needs to actually detect or track that somehow - it isn’t something intrinsic to.clear, for example.