Memory-Efficient Dictionary via Elastic Hashing (Achieving 40%+ RAM savings at high load factors)

Introduction

Hi everyone,

I’ve been exploring the practical implications of the recent paper “Optimal Bounds for Open Addressing Without Reordering” (Farach-Colton et al., arXiv:2501.02305, Jan 2025). I’ve developed a Proof-of-Concept C-extension for Python called DenseDict that implements Elastic Hashing to address the memory overhead inherent in CPython’s power-of-two resizing strategy.

The Problem

Currently, CPython’s dict doubles its capacity when it reaches a load factor of ≈66%. This leads to significant memory waste immediately after a resize, where up to 50% of allocated slots remain empty. Furthermore, these geometric resizes cause latency spikes (“stop-the-world” copies) that can be problematic for real-time or large-scale applications.

The Proposal: Elastic (Funnel) Hashing

DenseDict uses a multi-level funnel strategy with Lazy Level Allocation. Instead of one giant table that doubles in size, it uses geometrically shrinking levels (50%, 25%, 12.5%, etc.) and a dynamic probe limit per level.

Key Advantages:

  • High Load Factor: Operates comfortably at 85-95% load without search degradation.

  • No Reordering: Keys never move once placed, maintaining pointer stability (vital for CPython internals).

  • Predictable Insertion: Eliminates the massive latency spikes associated with doubling the table size.

Benchmark Results (900K Items)

I’ve conducted extensive benchmarks comparing DenseDict v2 against the built-in dict.

Metric DenseDict v2 CPython dict Improvement
Memory (Integer Keys) 23.25 MB 40.00 MB 41.9% Savings
Memory (String Keys) 23.25 MB 29.33 MB 20.7% Savings
Avg. Levels/Lookup 1.648 N/A High Density
Insert Degradation +5.6% +20.1% More Stable
Lookup Performance 4.27M ops/s 4.31M ops/s ~1.1% Difference

Note: Lookup performance is within 1% of CPython’s highly optimized implementation, while using significantly less RAM.

Interactive Proof-of-Concept

I have prepared a Google Colab and a GitHub Repository where these results can be reproduced:

Discussion

I believe this approach could be particularly beneficial for:

  1. FrozenDicts / Immutable mappings: Where high density is a priority.

  2. Memory-constrained environments: (e.g., embedded Python or micro-containers).

  3. Large-scale caches: Where the 20-40% memory saving translates to gigabytes of reduced overhead.

I’d love to hear the thoughts of the community and the dictobject.c maintainers on whether a multi-level funnel approach could be integrated as a specialized internal representation for high-density dictionaries.

Best regards

Jesus B/bueno12223
6 Likes

Just pointing out that no one would ever get “40% RAM savings”. The data (which is the dominant cost) is external to the dictionary object. The dictionary itself only has pointers to the data. The rule of thumb is that pointers are cheap and objects are expensive.

DenseDict uses a multi-level funnel strategy

Does that preserve insertion order and does it allow for instances to have key-sharing, or would we lose those?

1 Like

You’re right that Python objects live externally. However, the structural overhead of the hash table becomes a bottleneck in memory-bound systems with millions of keys. My goal with DenseDict isn’t just to save RAM on the objects themselves, but to optimize the control structure that points to them.

Regarding order and key-sharing: this PoC explores the theoretical limits of density. Integrating it into CPython would indeed require a hybrid approach to maintain PEP 412/468 compatibility, likely using Elastic Hashing as a high-density index provider

# Download densedict.c from GitHub (update URL to your repo)

...

Was this (partially) made by AI?

There seem to be a lot of emojis too. It’s nothing bad in general to use AI (although for coding, especially at a critical level like C it’s less smart), but it would have been nice if you had informed us. Also rereading your initial proposal, as well as your answer, it seems like you use LLMs for those too.

Same applies to the code too, noone uses the special … character, or the long —, nor would anyone structure their code using paragraphs, or use Epsilon and the 2 as index / base for log, instead of using log2 or E.

Incase im wrong, and your code and proposal was not written (or highly aided) by AI, I’m would be very sorry for saying so.

3 Likes

Which leaves me wondering if the memory savings and performance figures were also generated by an LLM and not by testing.

3 Likes

I don’t think you should be sorry even if you are incorrect. If a post looks like LLM usage, then the OP should be informed why it looks like that. It matters.

Also, you were kind in saying there’s “nothing wrong” with using these tools, but that gentleness may not be what’s needed in this case.
Setting aside the ethical questions (haha, is that even possible?), it’s important to be clear that using these tools without disclosure is unfair to your readers.

If it’s not LLM generated, I think the OP needs to explain how they did their benchmarking and why the repo appears fully formed in only two commits (which is not an organic way for a project to develop in git).


Because there are currently questions about the provenance of the code and veracity of the performance claims, I’ll be waiting to learn more before devoting any serious attention to the content in this case.

4 Likes

That’s unclear. While CPython never moves the memory allocated for a Python object (which is indeed vital for its internals), to containers (like dicts) a “Python object” is just a C pointer. It doesn’t matter at all if these pointers are moved around in memory, or even duplicated a million times. They’re just native C pointers. They point to memory at which the Python object’s data lives, and that’s what never moves (in CPython - other implementations differ).

An advantage of resizing is that all the pointers to keys are copied to a new, contrguous, memory space. No gaps. This speeds later iteration, which is an important feature of “ordered dicts”.

So at times I’ve wanted a new, say, dict.compress() method, to use after I’m done building a giant dict, to force it to resize to the smallest size compatible with the max 2/3rds load factor.

Other than that, there’s nothing objectionable about the abstract idea, and other parts of CPython embrace simple versions of it. Lists, for example, do not double in size, but follow a funky, uneven growth pattern, driven by lots of staring at actual growth patterns in real applications. Sets differ too, but use a lower load factor than dicts

I don’t think that was done for dicts, and it’s at best 'informed guesses" in any case.

The devil is in the details, though, and the referenced paper is too involved for an “elevator pitch”. It could help if you gave a concise but accurate summary of what “a multi-level funnel strategy” means, exactly.. We can already guess that if keys (pointers!) are never moved, we’ll lose the advantage mentioned above (that on a resize, all gaps between keys are squashed out of existence)..

9 Likes

And I forgot there’s “a reason” it wasn’t done for dicts: the probe sequence relies on that the number of slots (SIZE) is a power of 2. The initial slot looked at for a key with hash code H is the blazing fast H & MASK, where MASK == SIZE - 1 is invariant so long as the number of slots doesn’t change. That is, it simply takes the trailing k bits of H for a fixed-per-SIZE k.

Countering clustering isn’t done by making the initial slot picked “smart”, but by making the subsequent probe sequence vary wildly (pseudo-randomly) and depending on all the bits in H.

But we usually win on the first try, so that’s the one to optimize. Indeed, for a dict indexed by a contiguous range of ints (excluding -1), there are never collisions - “win on the first try” is always the case. That’s an important use case too.

3 Likes

Thank you all for the honest feedback and for pointing this out. I want to be completely transparent with the community: Yes, this Proof-of-Concept was developed with the assistance of an AI (LLM)

I used AI to help me quickly prototype the C extension and structure the proposal. My primary goal was a rapid exploration of the Elastic Hashing paper (arXiv:2501.02305) could actually be applied to improve CPython’s memory efficiency.

It was a “fast-track” experiment to see if the hypothesis works.

However, The benchmarks, memory figures, and performance graphs are NOT AI generated

These results come directly from running the code in controlled environments. I invite you to:

  1. Clone the repository and run the tests locally.

  2. Execute the Google Colab notebook I provided.

The Colab compiles the C code from scratch and generates those exact plots based on the live execution

This is my first time exploring this level of optimization within the Python ecosystem, and I apologize for not disclosing the AI assistance earlier. I will update the project details to reflect that this is an AI-aided optimization experiment :slight_smile:

2 Likes

And were your statistics LLM-generated too?

No
I invite you to run the collab or run it on your local machine
Let me know if the results its different or you see any fundamental error on the tests

I understand why the ‘N/A’ might look suspicious, but it represents a fundamental architectural difference, not a lack of data

CPython uses a single-level hash table. Therefore, its ‘level’ count is effectively always 1. In DenseDict, we use a multi-level funnel where keys overflow. The 1.648 figure is a real metric tracked during the execution of the Colab benchmark

Those figures and graphs are calculated by the script using sys.getsizeof for CPython and a custom memory_usage() method for the C extension that counts the allocated bytes for each level’s entries

What is the legal position on allowing LLM-generated code into the CPython code base? I’ve heard concerning suggestions that once LLM-generated code enters a codebase, the copyright and licensing position of the codebase becomes questionable. Obviously, it’s a rapidly changing area and there won’t be any precedents to base an opinion on yet, but is it something we want to take a risk on?

8 Likes

A big caution: from the Colab page:

We test both string keys (uniform distribution) and integer keys (trivial hash — stress test for collision handling).

But the ints used actually appear to be a best case (0 collisions) for your scheme:

keys_int = list(range(N))
size_t idx = hash % lv->capacity;
... and later, in the loop

    // Continue probing
    idx = (idx + 1) % lv->capacity;

Consecutive integers never collide under dirt-simple linear probing.

Consider instead keys like

offset = 42 # arbitrary int < capacity
keys_int = list(range(offset, offset + N * capacity, capacity))

The probe sequences of those are all identical: offset, offset + 1, offset + 2, …, a common “primary clustering” quadratic-time disaster mode for linear probing.

CPython’s implementation is also collision-free for your original int keys. The reason your scheme’s lookup takes about twice as long overall has to do with that your initial probe index is obtained via division (mod) rather than CPyhon’s much faster “just take the last k bits”. That’s the first and the last index looked at under both schemes for consecutive int inputs.

Depending on “offset”, the nastier arithmetic progressions can also create collisions under CPython’s probe sequence, but hard to exploit. The probe sequence is far from “just add 1”. For very large “offset”, the first dozen (or so) probes can coincide, but typically no worse than that. Higher-order bits of the hash are folded in as the probe sequence advances, and that’s key to breaking up chains of long probe collisions - the initial index looked at is just the start of the story. In linear probing, the initial index looked at determines the entire probe sequence.

If you like using AI bots, ask them for more explanation :wink: I know, e.g., that ChatGPT-5 understands CPython’s probe sequence far better than most people do. It looks very simple, but performs better than far fancier schemes tried in the past.

6 Likes

I asked Copilot to help me write up an explanation of CPython’s dict probing strategy.
I wrote the original implementation, and confirm this captures the real mechanics better than most human explanations I’ve seen.

Tools can’t replace understanding — but they help flesh it out and express it. Here’s the explanation Copilot produced:

2 Likes

That’s a lie, isn’t it?

For example if you have 16 slots, then starting at 1 and repeatedly multiplying by 5 (modulo 16) gets you 1->5->9->13->1. Visits only a quarter of all slots.

2 Likes

Yes, but a “lie to children” - I told it this was for a non-specialist audience, and not to get bogged down in technical details. If you care, consult the Hull-Dobell Theorem for necessary and sufficient conditions. For a modulus that’s a power of 2, they reduce to that the multiplier be congruent to 1 mod 4, and that the additive constant not be 0.

You’re missing something, but not sure what. As the bot said, the actual recurrence Pythion uses is:

i = (i * 5 + 1) & mask

For whatever reason, you left off the “+ 1” part. Put it back in, and it achieves full period. 1, 6, 15, 12, 13, 2, 11, 8, 9, 14, 7, 4, 5 ,10, 3 , 0 (and then it repeats).

The bot offered to show why multipliers of 3 and 7 don’t work (not congruent to 1 mod 4), but I told it “na, too detailed for what I’m after here”.

3 Likes

Note: no bot played any part in this reply. But I’m adopting its bullet-point style anyway, for clarity.

  • Probe strategy.
    • Linear probing is a pragmatic nightmare - far too easy to fall into quadratic-time cases.
    • CPython’s probe strategy was refined over years, and has no known “real life” very bad cases. Of course it’s possible to contrive bad sets of keys, but they’re exceedingly unlikely to show up in real life.
    • But it relies on having a power of 2 number of slots. Which also allows replacing expensive int mod with a 1-cycle bit mask.

So move to CPython’s probe strategy for each “level” (whatever that means, exactly). I believe it’s orthogonal to what you’re really after (multiple levels, not probe strategy).

  • Number of collisions is interesting and worth minimizing, but not primary. What matters much more is “pileup”: the longest collision chain encountered. Open addressing schemes that allow for high load factors are especially vulnerable to getting embarrassed by high pileup confined to a relatively small number of keys that just happen to be accessed especially frequently. CPython’s probe sequence is also resistant to high pileup.. If, e.g., the average number of probes turns out to be 3, it barely matters at all if the longest collision chain is of length 5. But if the average number of probes turns out to be 2, it’s a potential disaster if any key has pileup in the thousands.

  • You need to say more at a higher level.

    • For example, do dicts have a maximum size under this scheme? Or can they grow to an unbounfed number of “levels”? Massive dicts do show up in real life, and “just work” now (until you run out of RAM).
    • On the flip size, do dicts have a minimum size under this scheme? A large number of very small dicts is also common. If dicts “start life small” (as they do now), I’m not picturing how your scheme deals with that some of them may eventually grow to consume all of RAM :wink:
  • Something to note about the current scheme: on a resize, as mentioned before, the keys (pointers) are copied into a contiguous memory space. While that space shows up as “RAM consumed”, that’s pessimistic: it can very well be just an “address space reservation”, not consuming any actual HW resources unless and until the OS materializes a newly touched page of RAM. That is, the “empty space” after a doubling resize may be more a technical accounting quirk than something with actual consequences (although Windows does not allow mere address space reservations to exceed the amount or RAM + page file, even if the address space reserved is never actually used).

3 Likes

Contributions come from people with their name and their CLA signature and thus their responsibility attached. We have a stated policy up in Generative AI.

You cannot get a legal opinion here.

3 Likes

Because the bot said it’s the multiplying by 5 that guarantees the probe sequence will visit every slot before repeating. I just took it at its word.

2 Likes

Meh. I asked it not to get bogged down in technical details, and it complied. “Good enough” for purpose. Picking individual statements out of context is certain to expose “flaws”. Context you omitted:

  • It explicitly gave the fulll recurrence used, which does achieve full period.
  • It explicitly pointed out that the additive constant must be non-zero, else 0 would be a fixed point (a more succinct and cleaner proof of that i = 5*i can’t work than your exhaustive enumeration of what happens then).

But there’s a different reason for why 5 was picked, which the bot didn’t know. The multiplier must be congruent to 1 mod 4 to achieve full period. 1 isn’t suitable, because it would degenerate into linear probing then. The next smallest that could work is 5, and at the time the state of major HW and compilers was such that 5*i .generally ran slower than (i << 2) + i. And the code was originally written the latter way for that reason.

So that’s almost the fully story: Hull-Dobell plus “go fast” pragmatics. The remaining part is “why 1, when any non-zero additive constant would work?”. The answer is simple but boring: “why not 1? there’s no reason to imagine that any hon-zero constant would work better than any other”.

But all of that elaboration is too deep in the weeds for what I was after.

2 Likes