Memory-Efficient Dictionary via Elastic Hashing (Achieving 40%+ RAM savings at high load factors)

bueno1222 · February 17, 2026, 5:39pm

Introduction

Hi everyone,

I’ve been exploring the practical implications of the recent paper “Optimal Bounds for Open Addressing Without Reordering” (Farach-Colton et al., arXiv:2501.02305, Jan 2025). I’ve developed a Proof-of-Concept C-extension for Python called DenseDict that implements Elastic Hashing to address the memory overhead inherent in CPython’s power-of-two resizing strategy.

The Problem

Currently, CPython’s dict doubles its capacity when it reaches a load factor of ≈66%. This leads to significant memory waste immediately after a resize, where up to 50% of allocated slots remain empty. Furthermore, these geometric resizes cause latency spikes (“stop-the-world” copies) that can be problematic for real-time or large-scale applications.

The Proposal: Elastic (Funnel) Hashing

DenseDict uses a multi-level funnel strategy with Lazy Level Allocation. Instead of one giant table that doubles in size, it uses geometrically shrinking levels (50%, 25%, 12.5%, etc.) and a dynamic probe limit per level.

Key Advantages:

High Load Factor: Operates comfortably at 85-95% load without search degradation.
No Reordering: Keys never move once placed, maintaining pointer stability (vital for CPython internals).
Predictable Insertion: Eliminates the massive latency spikes associated with doubling the table size.

Benchmark Results (900K Items)

I’ve conducted extensive benchmarks comparing DenseDict v2 against the built-in dict.

Metric	DenseDict v2	CPython dict	Improvement
Memory (Integer Keys)	23.25 MB	40.00 MB	41.9% Savings
Memory (String Keys)	23.25 MB	29.33 MB	20.7% Savings
Avg. Levels/Lookup	1.648	N/A	High Density
Insert Degradation	+5.6%	+20.1%	More Stable
Lookup Performance	4.27M ops/s	4.31M ops/s	~1.1% Difference

Note: Lookup performance is within 1% of CPython’s highly optimized implementation, while using significantly less RAM.

Interactive Proof-of-Concept

I have prepared a Google Colab and a GitHub Repository where these results can be reproduced:

Colab: Google Colab
GitHub: GitHub - bueno12223/python-hashing

Discussion

I believe this approach could be particularly beneficial for:

FrozenDicts / Immutable mappings: Where high density is a priority.
Memory-constrained environments: (e.g., embedded Python or micro-containers).
Large-scale caches: Where the 20-40% memory saving translates to gigabytes of reduced overhead.

I’d love to hear the thoughts of the community and the dictobject.c maintainers on whether a multi-level funnel approach could be integrated as a specialized internal representation for high-density dictionaries.

Best regards

Jesus B/bueno12223

Wombat · February 17, 2026, 7:52pm

Just pointing out that no one would ever get “40% RAM savings”. The data (which is the dominant cost) is external to the dictionary object. The dictionary itself only has pointers to the data. The rule of thumb is that pointers are cheap and objects are expensive.

DenseDict uses a multi-level funnel strategy

Does that preserve insertion order and does it allow for instances to have key-sharing, or would we lose those?

bueno1222 · February 17, 2026, 8:35pm

You’re right that Python objects live externally. However, the structural overhead of the hash table becomes a bottleneck in memory-bound systems with millions of keys. My goal with DenseDict isn’t just to save RAM on the objects themselves, but to optimize the control structure that points to them.

Regarding order and key-sharing: this PoC explores the theoretical limits of density. Integrating it into CPython would indeed require a hybrid approach to maintain PEP 412/468 compatibility, likely using Elastic Hashing as a high-density index provider

JoBe · February 17, 2026, 9:26pm

# Download densedict.c from GitHub (update URL to your repo)

...

Was this (partially) made by AI?

There seem to be a lot of emojis too. It’s nothing bad in general to use AI (although for coding, especially at a critical level like C it’s less smart), but it would have been nice if you had informed us. Also rereading your initial proposal, as well as your answer, it seems like you use LLMs for those too.

Same applies to the code too, noone uses the special … character, or the long —, nor would anyone structure their code using paragraphs, or use Epsilon and the 2 as index / base for log, instead of using log2 or E.

Incase im wrong, and your code and proposal was not written (or highly aided) by AI, I’m would be very sorry for saying so.

Rosuav · February 17, 2026, 9:41pm

Which leaves me wondering if the memory savings and performance figures were also generated by an LLM and not by testing.

sirosen · February 17, 2026, 9:54pm

I don’t think you should be sorry even if you are incorrect. If a post looks like LLM usage, then the OP should be informed why it looks like that. It matters.

Also, you were kind in saying there’s “nothing wrong” with using these tools, but that gentleness may not be what’s needed in this case.
Setting aside the ethical questions (haha, is that even possible?), it’s important to be clear that using these tools without disclosure is unfair to your readers.

If it’s not LLM generated, I think the OP needs to explain how they did their benchmarking and why the repo appears fully formed in only two commits (which is not an organic way for a project to develop in git).

Because there are currently questions about the provenance of the code and veracity of the performance claims, I’ll be waiting to learn more before devoting any serious attention to the content in this case.

tim.one · February 17, 2026, 9:58pm

That’s unclear. While CPython never moves the memory allocated for a Python object (which is indeed vital for its internals), to containers (like dicts) a “Python object” is just a C pointer. It doesn’t matter at all if these pointers are moved around in memory, or even duplicated a million times. They’re just native C pointers. They point to memory at which the Python object’s data lives, and that’s what never moves (in CPython - other implementations differ).

An advantage of resizing is that all the pointers to keys are copied to a new, contrguous, memory space. No gaps. This speeds later iteration, which is an important feature of “ordered dicts”.

So at times I’ve wanted a new, say, dict.compress() method, to use after I’m done building a giant dict, to force it to resize to the smallest size compatible with the max 2/3rds load factor.

Other than that, there’s nothing objectionable about the abstract idea, and other parts of CPython embrace simple versions of it. Lists, for example, do not double in size, but follow a funky, uneven growth pattern, driven by lots of staring at actual growth patterns in real applications. Sets differ too, but use a lower load factor than dicts

I don’t think that was done for dicts, and it’s at best 'informed guesses" in any case.

The devil is in the details, though, and the referenced paper is too involved for an “elevator pitch”. It could help if you gave a concise but accurate summary of what “a multi-level funnel strategy” means, exactly.. We can already guess that if keys (pointers!) are never moved, we’ll lose the advantage mentioned above (that on a resize, all gaps between keys are squashed out of existence)..

tim.one · February 18, 2026, 7:01pm

And I forgot there’s “a reason” it wasn’t done for dicts: the probe sequence relies on that the number of slots (SIZE) is a power of 2. The initial slot looked at for a key with hash code H is the blazing fast H & MASK, where MASK == SIZE - 1 is invariant so long as the number of slots doesn’t change. That is, it simply takes the trailing k bits of H for a fixed-per-SIZE k.

Countering clustering isn’t done by making the initial slot picked “smart”, but by making the subsequent probe sequence vary wildly (pseudo-randomly) and depending on all the bits in H.

But we usually win on the first try, so that’s the one to optimize. Indeed, for a dict indexed by a contiguous range of ints (excluding -1), there are never collisions - “win on the first try” is always the case. That’s an important use case too.

bueno1222 · February 18, 2026, 7:19pm

Thank you all for the honest feedback and for pointing this out. I want to be completely transparent with the community: Yes, this Proof-of-Concept was developed with the assistance of an AI (LLM)

I used AI to help me quickly prototype the C extension and structure the proposal. My primary goal was a rapid exploration of the Elastic Hashing paper (arXiv:2501.02305) could actually be applied to improve CPython’s memory efficiency.

It was a “fast-track” experiment to see if the hypothesis works.

However, The benchmarks, memory figures, and performance graphs are NOT AI generated

These results come directly from running the code in controlled environments. I invite you to:

Clone the repository and run the tests locally.
Execute the Google Colab notebook I provided.

The Colab compiles the C code from scratch and generates those exact plots based on the live execution

This is my first time exploring this level of optimization within the Python ecosystem, and I apologize for not disclosing the AI assistance earlier. I will update the project details to reflect that this is an AI-aided optimization experiment

Rosuav · February 18, 2026, 7:23pm

And were your statistics LLM-generated too?

bueno1222 · February 18, 2026, 7:36pm

No
I invite you to run the collab or run it on your local machine
Let me know if the results its different or you see any fundamental error on the tests

I understand why the ‘N/A’ might look suspicious, but it represents a fundamental architectural difference, not a lack of data

CPython uses a single-level hash table. Therefore, its ‘level’ count is effectively always 1. In DenseDict, we use a multi-level funnel where keys overflow. The 1.648 figure is a real metric tracked during the execution of the Colab benchmark

Those figures and graphs are calculated by the script using sys.getsizeof for CPython and a custom memory_usage() method for the C extension that counts the allocated bytes for each level’s entries

pf_moore · February 18, 2026, 7:50pm

What is the legal position on allowing LLM-generated code into the CPython code base? I’ve heard concerning suggestions that once LLM-generated code enters a codebase, the copyright and licensing position of the codebase becomes questionable. Obviously, it’s a rapidly changing area and there won’t be any precedents to base an opinion on yet, but is it something we want to take a risk on?

tim.one · February 18, 2026, 8:43pm

A big caution: from the Colab page:

We test both string keys (uniform distribution) and integer keys (trivial hash — stress test for collision handling).

But the ints used actually appear to be a best case (0 collisions) for your scheme:

keys_int = list(range(N))

size_t idx = hash % lv->capacity;
... and later, in the loop

    // Continue probing
    idx = (idx + 1) % lv->capacity;

Consecutive integers never collide under dirt-simple linear probing.

Consider instead keys like

offset = 42 # arbitrary int < capacity
keys_int = list(range(offset, offset + N * capacity, capacity))

The probe sequences of those are all identical: offset, offset + 1, offset + 2, …, a common “primary clustering” quadratic-time disaster mode for linear probing.

CPython’s implementation is also collision-free for your original int keys. The reason your scheme’s lookup takes about twice as long overall has to do with that your initial probe index is obtained via division (mod) rather than CPyhon’s much faster “just take the last k bits”. That’s the first and the last index looked at under both schemes for consecutive int inputs.

Depending on “offset”, the nastier arithmetic progressions can also create collisions under CPython’s probe sequence, but hard to exploit. The probe sequence is far from “just add 1”. For very large “offset”, the first dozen (or so) probes can coincide, but typically no worse than that. Higher-order bits of the hash are folded in as the probe sequence advances, and that’s key to breaking up chains of long probe collisions - the initial index looked at is just the start of the story. In linear probing, the initial index looked at determines the entire probe sequence.

If you like using AI bots, ask them for more explanation I know, e.g., that ChatGPT-5 understands CPython’s probe sequence far better than most people do. It looks very simple, but performs better than far fancier schemes tried in the past.

tim.one · February 18, 2026, 9:12pm

I asked Copilot to help me write up an explanation of CPython’s dict probing strategy.
I wrote the original implementation, and confirm this captures the real mechanics better than most human explanations I’ve seen.

Tools can’t replace understanding — but they help flesh it out and express it. Here’s the explanation Copilot produced:

Copilot:

1. The core problem: collisions form clusters

Open‑addressing hash tables store entries directly in an array. When two keys hash to the same slot, you probe forward to find an empty one.

If you probe linearly (i+1, i+2, i+3…):

any collision creates a “run” of filled slots

future collisions pile onto that run

the run grows into a long, sticky cluster

performance collapses

This is the classic primary clustering problem.

2. The fix: make the probe sequence jump around

Instead of walking forward one step at a time, Python uses a pseudo‑random‑looking sequence:
i = (i * 5 + 1) & mask
This is a tiny linear congruential generator (LCG). It’s not truly random, but it’s random enough.

Why it helps:

Two keys that collide at the first slot almost always diverge on the next probe.

Clusters don’t accumulate into long runs.

Probe paths don’t synchronize.

Think of it as two people entering the same hallway but taking different zig‑zag routes instead of marching in lockstep.

3. Why multiply by 5? Why add 1? Why a power‑of‑two table size?

Multiplying by 5

5 is odd → invertible modulo a power of two.

That guarantees the probe sequence will visit every slot before repeating.

No short cycles, no unreachable slots.

Adding 1

Prevents the sequence from getting stuck at 0.

Breaks symmetry so the sequence depends on the starting point.

Power‑of‑two table size

Makes modulo cheap (& mask).

But powers of two are normally bad for hash tables because low‑bit patterns cause collisions.

Which leads to the next trick.

4. Folding in the high bits of the hash

If you only used the low bits of the hash:

keys whose hashes differ only in high bits would collide

their probe sequences would be identical

clustering would return with a vengeance

So Python mixes in higher‑order hash bits as it probes.

Effects:

Even if two keys collide initially, their probe sequences diverge quickly.

The “randomness” of the hash is injected gradually.

Weaknesses in low hash bits don’t matter.

It’s like two people starting at the same door but consulting different maps as they go.

5. The combined effect: clusters never get a chance to form

The LCG gives you:

a full‑cycle permutation of table slots

no short loops

no linear runs

no synchronized probe paths

The high‑bit folding gives you:

divergence between colliding keys

robustness against hash patterns

immunity to low‑bit hash weaknesses

Together, they produce a probe sequence that’s deterministic, cheap, branch‑free, and extremely effective at keeping the table healthy even at high load factors.

Stefan2 · February 18, 2026, 11:51pm

That’s a lie, isn’t it?

For example if you have 16 slots, then starting at 1 and repeatedly multiplying by 5 (modulo 16) gets you 1->5->9->13->1. Visits only a quarter of all slots.

tim.one · February 19, 2026, 12:37am

Yes, but a “lie to children” - I told it this was for a non-specialist audience, and not to get bogged down in technical details. If you care, consult the Hull-Dobell Theorem for necessary and sufficient conditions. For a modulus that’s a power of 2, they reduce to that the multiplier be congruent to 1 mod 4, and that the additive constant not be 0.

You’re missing something, but not sure what. As the bot said, the actual recurrence Pythion uses is:

i = (i * 5 + 1) & mask

For whatever reason, you left off the “+ 1” part. Put it back in, and it achieves full period. 1, 6, 15, 12, 13, 2, 11, 8, 9, 14, 7, 4, 5 ,10, 3 , 0 (and then it repeats).

The bot offered to show why multipliers of 3 and 7 don’t work (not congruent to 1 mod 4), but I told it “na, too detailed for what I’m after here”.

tim.one · February 19, 2026, 1:31am

Note: no bot played any part in this reply. But I’m adopting its bullet-point style anyway, for clarity.

Probe strategy.
- Linear probing is a pragmatic nightmare - far too easy to fall into quadratic-time cases.
- CPython’s probe strategy was refined over years, and has no known “real life” very bad cases. Of course it’s possible to contrive bad sets of keys, but they’re exceedingly unlikely to show up in real life.
- But it relies on having a power of 2 number of slots. Which also allows replacing expensive int mod with a 1-cycle bit mask.

So move to CPython’s probe strategy for each “level” (whatever that means, exactly). I believe it’s orthogonal to what you’re really after (multiple levels, not probe strategy).

Number of collisions is interesting and worth minimizing, but not primary. What matters much more is “pileup”: the longest collision chain encountered. Open addressing schemes that allow for high load factors are especially vulnerable to getting embarrassed by high pileup confined to a relatively small number of keys that just happen to be accessed especially frequently. CPython’s probe sequence is also resistant to high pileup.. If, e.g., the average number of probes turns out to be 3, it barely matters at all if the longest collision chain is of length 5. But if the average number of probes turns out to be 2, it’s a potential disaster if any key has pileup in the thousands.
You need to say more at a higher level.
- For example, do dicts have a maximum size under this scheme? Or can they grow to an unbounfed number of “levels”? Massive dicts do show up in real life, and “just work” now (until you run out of RAM).
- On the flip size, do dicts have a minimum size under this scheme? A large number of very small dicts is also common. If dicts “start life small” (as they do now), I’m not picturing how your scheme deals with that some of them may eventually grow to consume all of RAM
Something to note about the current scheme: on a resize, as mentioned before, the keys (pointers) are copied into a contiguous memory space. While that space shows up as “RAM consumed”, that’s pessimistic: it can very well be just an “address space reservation”, not consuming any actual HW resources unless and until the OS materializes a newly touched page of RAM. That is, the “empty space” after a doubling resize may be more a technical accounting quirk than something with actual consequences (although Windows does not allow mere address space reservations to exceed the amount or RAM + page file, even if the address space reserved is never actually used).

gpshead · February 19, 2026, 6:35am

Contributions come from people with their name and their CLA signature and thus their responsibility attached. We have a stated policy up in Generative AI.

You cannot get a legal opinion here.

Stefan2 · February 19, 2026, 10:55am

Because the bot said it’s the multiplying by 5 that guarantees the probe sequence will visit every slot before repeating. I just took it at its word.

tim.one · February 19, 2026, 9:22pm

Meh. I asked it not to get bogged down in technical details, and it complied. “Good enough” for purpose. Picking individual statements out of context is certain to expose “flaws”. Context you omitted:

It explicitly gave the fulll recurrence used, which does achieve full period.
It explicitly pointed out that the additive constant must be non-zero, else 0 would be a fixed point (a more succinct and cleaner proof of that i = 5*i can’t work than your exhaustive enumeration of what happens then).

But there’s a different reason for why 5 was picked, which the bot didn’t know. The multiplier must be congruent to 1 mod 4 to achieve full period. 1 isn’t suitable, because it would degenerate into linear probing then. The next smallest that could work is 5, and at the time the state of major HW and compilers was such that 5*i .generally ran slower than (i << 2) + i. And the code was originally written the latter way for that reason.

So that’s almost the fully story: Hull-Dobell plus “go fast” pragmatics. The remaining part is “why 1, when any non-zero additive constant would work?”. The answer is simple but boring: “why not 1? there’s no reason to imagine that any hon-zero constant would work better than any other”.

But all of that elaboration is too deep in the weeds for what I was after.