PEP 603: Adding a frozenmap type to collections

Yury said he was going to repost the PEP shortly, so best to wait for that. It’ll probably restart at that point. I’m not sure hamt is a great name either - I’m sure I had some suggestions earlier, but I’d expect something like collections.SomeKindOfTable or collections.ImmutableSomething.

1 Like

Just skimmed the whole thread and can’t find a post with name suggestions - - they all seem to be some combination of “dict”/“map” with “frozen” and sometimes “copy”. Sorry if I just missed it, there is a LOT.

I’m happy with any name that’s more specific about the kind of algorithm involved than just “dict” or “map” – as you said above (IIRC), I think those will both lead to incorrect assumptions. hamt is just the best I could come up with that was unambiguous. Sadly, I think bikeshedding a great name is a key next step.

1 Like

(for context, you were talking about the name)
Maybe frozendict does, but FrozenDict would go well alongside OrderedDict.
I don’t think “hamt” is intelligible for most users, and if dict someday moves on to another implementation, I don’t think it would make sense for the frozen one not to ; same for another implementation of Python that would choose another implementation for dict.
And if your point is that the frozen dict uses the same implem as the native dict, putting “dict” in the name kinda implies that.

Or we could have DictALaMode \s

Otherwise I don’t understand what wrong assumptions people would risk making, aside from hashing ? For me “frozendict” or “FrozenDict” just yells “dict without the mutations”, exactly like the set/frozenset API does.


Going back a few :

I agree. The problematic case I mentioned earlier involved merging two dicts and using the update method since it’s faster than the a = b | c syntax, so the mutating happened by accident.
EDIT: yes, and I remember that update came from an earlier version of the same function where the dict was a **kwargs, so the update was safe. And it became not safe when I had to pass two dicts to the function, without the ** syntax, and I didn’t change the rest of the function’s code.
Anyway, so :

Well the issue in that case is that I just used the dict API witout worrying enough :sweat_smile:

I don’t know about everyone, but in that project we just don’t have any test suite, nor the time or manpower to have one. That’s the problem. Later you say “with sufficient tests…”, but the point is that we, and I suspect many people cannot afford for various reason to have sufficient, or really any permanent test suite at all. Having an object, like tuple, like frozenset, that raises an exception when you try to mutate it (as well as triggering type warnings in the IDE as a bonus, something no dict subclass will) goes a long way for that kind of projects.


Maybe piggybacking the dict C implem is easy at the scale of cpython and for people who know of C, and who can just use the same C function for each method the two types have in common.
Advantages and trivialities which become way harder to overcome when you only know how to write Python code, and I wrote at length all the potholes you encounter when trying to make a working pure-Python frozen dict implem.
Something that can be trivially simple to implement in cpython - ish, it’s probably not that simple, but it’s the simpler way to add a frozen dict in cpython - can be almost impossible to do outside of it, the argument does not go both ways.

1 Like

I think there are really only two contenders for the potential semantics of a frozen mapping type:

  • frozendict: underlying implementation shared with dict (including key sharing, insertion order preservation, etc), exposes Mapping API rather than MutableMapping, adds hashing support. Unlike MappingProxyType, no hidden backdoors that allow mutation.
  • frozenmap/hamt: the proposal in PEP 603, exposing the data type that backs contextvars.Context, with a few additional niceties added to the public API as described in the PEP

Adding one without the other is starting to feel like a “close but not quite” solution to me:

  • frozendict on its own has poor scaling properties when you do need to make changes to it (hence contextvars using HAMT instead). While ChainMap can mitigate those costs (at the expense of making individual lookups more expensive), it’s still not hard to mess up and invadvertently make an algorithm quadratic or worse by adding in a dict copy operation.
  • hamt on its own would inevitably be adopted for things that it’s bad at (e.g. publishing true constant values)

If both data structures were to be added, then frozentreemap might be a better name for the hamt container to make it clearer how it differs from frozendict (“it’s a hash tree rather than a hash table, so its performance characteristics are different from a regular dict” feels marginally easier to explain than having to dive into the weeds of exactly what “hamt” stands for, while offering both frozendict and frozenmap in the same library seems like a recipe for eternal confusion).

12 Likes

As a one-time C++ coder, I’d assume “treemap” meant an ordered red/black tree, rather than a hash-based structure.

My brain was trying to tell me there was a problem with the frozentreemap naming suggestion, and the potential ambiguity with a key ordered tree map is exactly the niggle that wasn’t coming fully to mind when I wrote the post.

frozenhamt may be the least bad option in that case (frozenhashtreemap might also be OK, albeit a bit verbose).

Regardless, while I originally thought that frozenmap was adequately distinct from frozendict, I’ve become significantly less convinced of that over the course of the discussion.

3 Likes

Still in the idea that the second one (the not-frozendict) would need to be implemented, two thoughts.

First, since it would most likely not end up in builtins but rather somewhere in the stdlib (collections ?), it should be in camel case rather than in all lowercase. I think OrderedDict sets a better example than defaultdict for that matter.

Second, like OrderedDict, I think the name should be chosen usecase-wise rather than implementation-wise. The PEP seems to indicate that copies are what’s really optimized there, so CopyableMap ? or something like that.
We don’t usually (or rather, we shouldn’t in my opinion) impose algorithms for the implementation that other Python implem need to follow, so if some of them want to implement CopyableMap with another copy-oriented implem than hamt, they shouldn’t have to make the class name lie. Same if we later find a better copy-oriented algorithm that we decide is preferable.

Tbh I don’t really see a practical use I could make of that particular optimization, so I think the PEP should go in the direction of the frozendict (or collections.FrozenDict as a compromise) as the primary objective, and leave hamt/CopyableMap either in a twin/later PEP, or as a second type introduced in the same PEP.

7 Likes

Not imposing algorithmic constraints would be very surprising. e.g. if creating a copy with a modified element is O(log n) in cpython but O(n) in pypy, that code may technically still function, but will need to be rewritten anyway.

1 Like

The OrderedDict documentation does not give big-O constraints or guarantees. It only says that

  • The regular dict was designed to be very good at mapping operations. Tracking insertion order was secondary.
  • The OrderedDict was designed to be good at reordering operations. Space efficiency, iteration speed, and the performance of update operations were secondary.

I think writing something in the same vein about FrozenDict v CopyableMap and maybe mentioning the HAMT algorithm in an implementation detail notice, along with its fantastic O(1) copying performance, that would be fine by me. At best, recommend the other implems to strive for similar complexity.

I don’t think it’s safe to require of other implems to reach any big-O complexity, because I can’t know all the constraints of all possible implementations : in your example, maybe it’s impossible in a certain context to rewrite the code and have better than O(n), or not possible without tripling the memory requirement, or, or, or.

1 Like

I don’t see that’s it’s useful to users to have an unbounded possibility of implementations. We rely on algorithmic properties to write our code.

2 Likes

I think OrderedDict has proven useful to users while having some constraints put on what uses it is optimized for, but no constraints (none documented at least, so none applying to the other implems) concerning its complexity.
Obviously you rely on algorithmic properties (offered by the language you’re writing in) to write your code, that does not mean they need to be documented. And as I said, you can still signal them as useful implementation details.

But all of this is supposing that this HAMT deserves to be implemented in the stdlib (not to mention as a builtin). If y’all stance is that it absolutely has to be that particular algorithm, rather than just offering the feature of a mapping that’s copy-optimized, then I see even less reasons for implementing it - and all the more reasons to only add frozendict/collections.FrozenDict - as to my knowledge no type is documented as implemented the same way across all implementations and platforms.

2 Likes

What does “copy-optimized” even mean if it implies no algorithmic bounds on implementation and could be O(n) in space and time?

It means that CopyableMap.copy() is faster or otherwise better than dict.copy() or frozendict.copy().
Which, if you read my dict v OrderedDict analogy, is exactly what I understand the current implementation to specify about od.popitem(last=True) versus d.popitem(), or od.move_to_end(k, last=True) versus d[k] = d.pop(k).
So there is a relative algorithmic boundary which is the performance/complexity of dict.copy(), though considering that it has no documented boundary across any implementation, it means there is no big-O absolute boundary for any type. And there doesn’t have to be.

But I’ll repeat what I just said above, if you think that this admittedly light specification is not enough for CopyableMap to be implemented in a world where frozendict already exists, then I question the need for a CopyableMap/HAMT at all.

1 Like

While implementations of Python might theoretically be free to pick arbitrary algorithms for core datatypes, in practice cpython is constrained to not break the performance of existing code, and other implementations are generally trying to perform at least as fast as cpython so are similarly constrained. So all that is really gained by not documenting complexity is a lack of documentation.

Whatever algorithm is picked, it will need to be hash-based, because there is no way to rely on key ordering. Having less than linear space constraint on copy without losing lookup performance forces a tree. I don’t think calling this thing a name involving “hash” and “tree” is unreasonable. It doesn’t have to be as specific as “hamt”: I already said we could do with better suggestions. Maybe FrozenHashTree. But “copy map” doesn’t scan and doesn’t really mean anything to me.

I don’t think refusing to add something to core Python because it requires thinking about algorithms is a very batteries-included approach. I don’t think it really reflects on the true nature of what is already in core Python either.

2 Likes

I’d clarify that my problem is not that from a user’s point of view “it requires thinking about algorithms” - after all, the functions in the hashlib module are named after algorithms - but rather that it constrains the maintainer to keep that algorithm even when a better one is found afterwards.

As far as I know, it’s computationally impossible to prove that a certain algorithm is the best at its job, so I would not rely on that.


But I think I’ll stop there about the HAMT/CopyableMap, as it’s not really what I care much about seeing implemented, and I’ll focus on the need for the frozendict / collections.FrozenDict.
My stance is that the PEP should focus on frozendict (same as dict without the mutability), and that HAMT should be not added to the stdlib, or left for a later PEP, or described in the same PEP as the frozendict.

1 Like

@yselivanov Is your PEP resubmission going to stick with the HAMT version or is there a possibility you go with the frozendict version?

HAMT is already part of the stdlib (it backs the contextvars implementation, as mentioned in the Rationale section of PEP 603). It’s just not exposed in a way that allows users with similar use cases to readily access that implementation, they have to grab one from PyPI instead.

While it was PEP 567 that added contextvars, the related HAMT performance analysis was in the withdrawn execution context proposal in PEP 550 (PEP 567 just referred back to the PEP 550 performance measurements to explain why HAMT was being used as the underlying data structure).

For truly immutable mappings, collections.frozendict is going to be a better answer, since their only downside is the high cost of making changes: you have to copy the entire dictionary even if you’re only changing one key.

The niche that HAMT slots into is one where:

  • the cost of chaining lookups over multiple mappings is too high to be considered negligible
  • the expected frequency of mutation is too high to ignore the linear cost of making copies

There isn’t a handy English word that summarises that niche the way that OrderedDict, defaultdict, or frozendict summarise the differences between themselves and a regular dict. (Edit: after posting this, one possibility that did occur to me is SnapshotMap, since the point is that mutation is handled as a series of independent snapshots rather than via in-place mutation)

Naming it after the algorithm performance runs into the problem that outside specific mathematical contexts, log in programming is more likely to mean logging rather than logarithm.

Hence the suggestion of just naming it after the initial implementation technique, with a name like collections.frozenhamt or collections.FrozenHAMT. If someone manages to invent a different data structure that is able to present the interface and logarithmic scaling properties of a hash-array-mapped-trie while somehow being more efficient under the hood (e.g. lower constant scaling factors for memory usage or operation durations) that would be fine, since the name is a way of summarising the interface and expected algorithmic performance characteristics, rather than directly specifying the required implementation.

Tangential thought prompted by the above notes: PEP 603 should include making ChainMap hashable if all of the included mappings are hashable.

6 Likes

Not sure this can be done efficiently? You can’t just hash the underlying maps as keys might be overridden in other maps, so you’d need to compute the final content of the map by iterating every one. But I think you’d still need to hash each underlying map anyway just to check they were immutable? So this would be very expensive.

3 Likes

Is there a Python standard definition of how to calculate the hash of a frozen mapping of any kind? If not, would this PEP need to set that definition?

For context (it was not obvious to me), the reason why that wouldn’t violate the __hash__ v __eq__ rule:

If a class defines mutable objects and implements an __eq__() method, it should not implement __hash__(), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).

is that if all the included mappings are hashable and the first one in particular, which means that they are immutable and that mutation operations will fail on that ChainMap.

But one nitpick in that (in addition to what Alice pointed out) is that ChainMap’s API is optimized for mutable mappings : the new_child method creates a new mutable mapping by default and would instantly break the hashability. And I’m not sure the errors are pretty when mutability fails, though that’s easily fixed.


PEP 416 mentions hash(frozenset(d.items())). That’s (aligned with) how dict equality works, so I even independently wrote the same snipped some time ago before reading it. It’s probably as standard as you’re going to get.


I also noticed something : that PEP 416, which has not been mentioned here for a long time. If the resubmitted PEP just takes on frozendict and leaves the HAMT behind, then it’s just a reopening of 416 with slightly new arguments - and a new perspective on the persisting need for a frozendict after MappingProxyType’s integration.
I wrote a draft for that, with just the required edits to make 416 up-to-date, but I don’t want to post it too early and steal Yury’s initiative.
And it’s only if we yank HAMT away to another PEP and keep only frozendict, if we put the two in one PEP (with maybe that ChainMap thing as a bonus) then the situation is not the same.

2 Likes