PEP 603: Adding a frozenmap type to collections

Yury said he was going to repost the PEP shortly, so best to wait for that. It’ll probably restart at that point. I’m not sure hamt is a great name either - I’m sure I had some suggestions earlier, but I’d expect something like collections.SomeKindOfTable or collections.ImmutableSomething.

1 Like

Just skimmed the whole thread and can’t find a post with name suggestions - - they all seem to be some combination of “dict”/“map” with “frozen” and sometimes “copy”. Sorry if I just missed it, there is a LOT.

I’m happy with any name that’s more specific about the kind of algorithm involved than just “dict” or “map” – as you said above (IIRC), I think those will both lead to incorrect assumptions. hamt is just the best I could come up with that was unambiguous. Sadly, I think bikeshedding a great name is a key next step.

1 Like

(for context, you were talking about the name)
Maybe frozendict does, but FrozenDict would go well alongside OrderedDict.
I don’t think “hamt” is intelligible for most users, and if dict someday moves on to another implementation, I don’t think it would make sense for the frozen one not to ; same for another implementation of Python that would choose another implementation for dict.
And if your point is that the frozen dict uses the same implem as the native dict, putting “dict” in the name kinda implies that.

Or we could have DictALaMode \s

Otherwise I don’t understand what wrong assumptions people would risk making, aside from hashing ? For me “frozendict” or “FrozenDict” just yells “dict without the mutations”, exactly like the set/frozenset API does.


Going back a few :

I agree. The problematic case I mentioned earlier involved merging two dicts and using the update method since it’s faster than the a = b | c syntax, so the mutating happened by accident.
EDIT: yes, and I remember that update came from an earlier version of the same function where the dict was a **kwargs, so the update was safe. And it became not safe when I had to pass two dicts to the function, without the ** syntax, and I didn’t change the rest of the function’s code.
Anyway, so :

Well the issue in that case is that I just used the dict API witout worrying enough :sweat_smile:

I don’t know about everyone, but in that project we just don’t have any test suite, nor the time or manpower to have one. That’s the problem. Later you say “with sufficient tests…”, but the point is that we, and I suspect many people cannot afford for various reason to have sufficient, or really any permanent test suite at all. Having an object, like tuple, like frozenset, that raises an exception when you try to mutate it (as well as triggering type warnings in the IDE as a bonus, something no dict subclass will) goes a long way for that kind of projects.


Maybe piggybacking the dict C implem is easy at the scale of cpython and for people who know of C, and who can just use the same C function for each method the two types have in common.
Advantages and trivialities which become way harder to overcome when you only know how to write Python code, and I wrote at length all the potholes you encounter when trying to make a working pure-Python frozen dict implem.
Something that can be trivially simple to implement in cpython - ish, it’s probably not that simple, but it’s the simpler way to add a frozen dict in cpython - can be almost impossible to do outside of it, the argument does not go both ways.

I think there are really only two contenders for the potential semantics of a frozen mapping type:

  • frozendict: underlying implementation shared with dict (including key sharing, insertion order preservation, etc), exposes Mapping API rather than MutableMapping, adds hashing support. Unlike MappingProxyType, no hidden backdoors that allow mutation.
  • frozenmap/hamt: the proposal in PEP 603, exposing the data type that backs contextvars.Context, with a few additional niceties added to the public API as described in the PEP

Adding one without the other is starting to feel like a “close but not quite” solution to me:

  • frozendict on its own has poor scaling properties when you do need to make changes to it (hence contextvars using HAMT instead). While ChainMap can mitigate those costs (at the expense of making individual lookups more expensive), it’s still not hard to mess up and invadvertently make an algorithm quadratic or worse by adding in a dict copy operation.
  • hamt on its own would inevitably be adopted for things that it’s bad at (e.g. publishing true constant values)

If both data structures were to be added, then frozentreemap might be a better name for the hamt container to make it clearer how it differs from frozendict (“it’s a hash tree rather than a hash table, so its performance characteristics are different from a regular dict” feels marginally easier to explain than having to dive into the weeds of exactly what “hamt” stands for, while offering both frozendict and frozenmap in the same library seems like a recipe for eternal confusion).

9 Likes

As a one-time C++ coder, I’d assume “treemap” meant an ordered red/black tree, rather than a hash-based structure.

My brain was trying to tell me there was a problem with the frozentreemap naming suggestion, and the potential ambiguity with a key ordered tree map is exactly the niggle that wasn’t coming fully to mind when I wrote the post.

frozenhamt may be the least bad option in that case (frozenhashtreemap might also be OK, albeit a bit verbose).

Regardless, while I originally thought that frozenmap was adequately distinct from frozendict, I’ve become significantly less convinced of that over the course of the discussion.

2 Likes

Still in the idea that the second one (the not-frozendict) would need to be implemented, two thoughts.

First, since it would most likely not end up in builtins but rather somewhere in the stdlib (collections ?), it should be in camel case rather than in all lowercase. I think OrderedDict sets a better example than defaultdict for that matter.

Second, like OrderedDict, I think the name should be chosen usecase-wise rather than implementation-wise. The PEP seems to indicate that copies are what’s really optimized there, so CopyableMap ? or something like that.
We don’t usually (or rather, we shouldn’t in my opinion) impose algorithms for the implementation that other Python implem need to follow, so if some of them want to implement CopyableMap with another copy-oriented implem than hamt, they shouldn’t have to make the class name lie. Same if we later find a better copy-oriented algorithm that we decide is preferable.

Tbh I don’t really see a practical use I could make of that particular optimization, so I think the PEP should go in the direction of the frozendict (or collections.FrozenDict as a compromise) as the primary objective, and leave hamt/CopyableMap either in a twin/later PEP, or as a second type introduced in the same PEP.

6 Likes

Not imposing algorithmic constraints would be very surprising. e.g. if creating a copy with a modified element is O(log n) in cpython but O(n) in pypy, that code may technically still function, but will need to be rewritten anyway.

The OrderedDict documentation does not give big-O constraints or guarantees. It only says that

  • The regular dict was designed to be very good at mapping operations. Tracking insertion order was secondary.
  • The OrderedDict was designed to be good at reordering operations. Space efficiency, iteration speed, and the performance of update operations were secondary.

I think writing something in the same vein about FrozenDict v CopyableMap and maybe mentioning the HAMT algorithm in an implementation detail notice, along with its fantastic O(1) copying performance, that would be fine by me. At best, recommend the other implems to strive for similar complexity.

I don’t think it’s safe to require of other implems to reach any big-O complexity, because I can’t know all the constraints of all possible implementations : in your example, maybe it’s impossible in a certain context to rewrite the code and have better than O(n), or not possible without tripling the memory requirement, or, or, or.

I don’t see that’s it’s useful to users to have an unbounded possibility of implementations. We rely on algorithmic properties to write our code.

I think OrderedDict has proven useful to users while having some constraints put on what uses it is optimized for, but no constraints (none documented at least, so none applying to the other implems) concerning its complexity.
Obviously you rely on algorithmic properties (offered by the language you’re writing in) to write your code, that does not mean they need to be documented. And as I said, you can still signal them as useful implementation details.

But all of this is supposing that this HAMT deserves to be implemented in the stdlib (not to mention as a builtin). If y’all stance is that it absolutely has to be that particular algorithm, rather than just offering the feature of a mapping that’s copy-optimized, then I see even less reasons for implementing it - and all the more reasons to only add frozendict/collections.FrozenDict - as to my knowledge no type is documented as implemented the same way across all implementations and platforms.

What does “copy-optimized” even mean if it implies no algorithmic bounds on implementation and could be O(n) in space and time?

It means that CopyableMap.copy() is faster or otherwise better than dict.copy() or frozendict.copy().
Which, if you read my dict v OrderedDict analogy, is exactly what I understand the current implementation to specify about od.popitem(last=True) versus d.popitem(), or od.move_to_end(k, last=True) versus d[k] = d.pop(k).
So there is a relative algorithmic boundary which is the performance/complexity of dict.copy(), though considering that it has no documented boundary across any implementation, it means there is no big-O absolute boundary for any type. And there doesn’t have to be.

But I’ll repeat what I just said above, if you think that this admittedly light specification is not enough for CopyableMap to be implemented in a world where frozendict already exists, then I question the need for a CopyableMap/HAMT at all.

While implementations of Python might theoretically be free to pick arbitrary algorithms for core datatypes, in practice cpython is constrained to not break the performance of existing code, and other implementations are generally trying to perform at least as fast as cpython so are similarly constrained. So all that is really gained by not documenting complexity is a lack of documentation.

Whatever algorithm is picked, it will need to be hash-based, because there is no way to rely on key ordering. Having less than linear space constraint on copy without losing lookup performance forces a tree. I don’t think calling this thing a name involving “hash” and “tree” is unreasonable. It doesn’t have to be as specific as “hamt”: I already said we could do with better suggestions. Maybe FrozenHashTree. But “copy map” doesn’t scan and doesn’t really mean anything to me.

I don’t think refusing to add something to core Python because it requires thinking about algorithms is a very batteries-included approach. I don’t think it really reflects on the true nature of what is already in core Python either.

I’d clarify that my problem is not that from a user’s point of view “it requires thinking about algorithms” - after all, the functions in the hashlib module are named after algorithms - but rather that it constrains the maintainer to keep that algorithm even when a better one is found afterwards.

As far as I know, it’s computationally impossible to prove that a certain algorithm is the best at its job, so I would not rely on that.


But I think I’ll stop there about the HAMT/CopyableMap, as it’s not really what I care much about seeing implemented, and I’ll focus on the need for the frozendict / collections.FrozenDict.
My stance is that the PEP should focus on frozendict (same as dict without the mutability), and that HAMT should be not added to the stdlib, or left for a later PEP, or described in the same PEP as the frozendict.

1 Like

@yselivanov Is your PEP resubmission going to stick with the HAMT version or is there a possibility you go with the frozendict version?