PEP 603: Adding a frozenmap type to collections

I think the hold-up is Yury having time to continue pushing the PEP forward.

Given that it doesn’t appear to have been actively worked on for several years, any objection to at least marking the PEP as deferred (unless @yselivanov wants to formally withdraw or update it)? Right now it still lists Python 3.9 as the target version.

I would drop him a note on email.

I’m OK to mark it “deferred”, although I’m not sure what that will actually change.

AFAIK @pablogsal wanted to pick up the PEP and get it over the finish line. Pablo, wanna join forces?

3 Likes

It just makes clear the PEP isn’t currently under active discussion, though it may be picked up at some point in the future. But if Pablo might want to pick it up, I’ll just leave it as it is for the moment (except dropping the outdated Python version in python/peps#2756).

1 Like

Just to add to this great point, I remember seeing a frozendict in some data science libraries here: flax/flax/core/frozen_dict.py at e7742de22ba7c57308e0995a5793d014c741cffa · google/flax · GitHub

Would be great to see uses like this supplanted by something in the standard library.

All great points. Another reason to have the frozenmap implementation on PyPi is so that people can start using it before they upgrade to the Python version that exposes it.

Just want to point out that there are some frozendict/frozenmap implementations in various github projects.

The frozendict in Flax that I linked above does expose and use hashability (of the keys and values in the mapping).

3 Likes

I think everyone uses this actually GitHub - MagicStack/immutables: A high-performance immutable mapping type for Python.

4 Likes

Given that we already have the needed HMAT implementation in Python as part of the Context/ContextVar logic, exposing this as a frozenmap in collection sounds like a great idea.

I guess, the only holdup is finding someone to actually do the necessary work and create a PR.

11 Likes

Well, doesn’t the PEP also need to be submitted to the SC and approved (and any necessary updates made beforehand)?

Of course, but having an implementation to review simplifies the process somewhat, since it allows playing with the functionality, hashing out quirks and also demonstrates just how much extra maintenance such a module would require.

The PEP may need a co-author, though, in case @yselivanov no longer has cycles available to push it forward.

Such an implementation could also provide the basis for optimizations to the match statement (e.g. in case you only match on constant hashable values) and would help somewhat in optimizing lookup tables (which normally work on static mappings).

6 Likes

It came out in the discussion last time (perhaps in this thread? I haven’t reread it) that these don’t actually get any improvement.

The HAMT implementation we have sacrifices lookup time in favour of memory usage. IIRC, it’s constant time copy and logarithmic time lookup (though the base is set high enough that it’s basically constant for a couple hundred elements).

So if you’re planning on having hundreds of slight variants of each table floating around, they can share the common parts in memory (assuming you can build most of it in one go and then share that object reference in order to produce the variants). But it’s always algorithmically slower lookup than our existing dict.

The fact that people keep thinking they’ll get speed optimizations out of HAMT is why I think we shouldn’t expose it as a public API. Apparently, it’s not obvious what the benefits of this data structure are, and so it’s very likely to distract more than it benefits anyone.

1 Like

Good point. So it’s optimized for fast copying (which makes sense for the ContextVar use case) and would only provide benefits for the match statement optimization with just a few options, not for lookup tables with thousands of entries.

I guess what people would expect from a frozenmap are the following requirements:

  • O(1) (amortized) lookup time
  • low memory overhead
  • no preferences to number of keys
  • O(n) time to build the mapping

Our dict implementation is probably the most optimized data structure we have in Python, so would be ideal for the above, except that building the dict can result in more then O(n) time due to having to rehash the table every now and then while building it.

What’s missing is a good way to build a dict with a known number of entries, which doesn’t need resizing during the insert phase, ie. a constructor hint which allows passing in a size value, so that the dict implementation can start with an empty dict, run a C dictresize() with the hint value and then proceed filling the dict with the given (constant) set of key-value pairs.

The dict implementation has some optimizations in the dict.fromkeys() contructor, but unfortunately, only when using dicts or sets as basis, and then, of course, drops the value information. There is an internal C API _PyDict_FromItems(), though, which does exactly what is needed.

In order to use the dict implementation for a frozenmap, it would also have to get a flag to prevent manipulations after the initial creation.

Back to the topic of maintenance, reusing the dict implementation would have the same effect as reusing the HMAT implementation, so this would be a true alternative. In addition, you’d get all the benefits from having a regular dict.

1 Like

(post deleted by author)

They’re close:

Also from experience (twice) industry often won’t choose the former due to licensing.

Thanks, but I don’t think duplicating the dict implementation with additional tweaks is a good approach. The maintenance cost is too high and such a copy would eventually get out of sync with the main dict implementation.

AFAICT, we’d just need two things:

  • expose the _PyDict_FromItems() as a new constructor for dicts
  • add a flag to make them readonly

Both would also be useful to have to regular dicts, e.g. you may want to make a dict readonly after a setup phase and it’s also not uncommon to create a dict from a larger list of items when initializing it (as opposed to using an iterator to feed in new data, which doesn’t expose any size information).

A new type is not really necessary, unless there’s a need to do type checking on such frozendicts.

Note that _PyDict_FromItems() and _PyDict_NewPresized() may waste memory when there are many duplicated keys.

For mutable dict, extra space would be used for future insertion.
But for immutable dicts, extra space just waste memory.

So I don’t think _PyDict_FromItems() is important for frozendict.
This microoptimization is good for keyword argument, but not so good for general data structure.

True, but it’s rather uncommon to have duplicates in tables of data you are using to build static lookup dicts.

As for the memory overhead: it may be possible to reduce this by adjusting how the log2_newsize is estimated in dict_new_presized() for frozendicts.

Something to investigate: How a low memory overhead affects lookup times. A smaller hash table may well affect the number of collisions you see in lookups. It’s a memory-speed tradeoff.

The main reason to use _PyDict_FromItems() for this is to avoid resizing the dict while inserting the data (since that’s an expensive operation). A single resize should suffice when you know the total number of key-value pairs you want to have in a dict upfront.

I may be wrong, but from looking at the code, it seems that resizes only happen when adding new data, or when popping items and causing a switch from split tables to combined tables. Other deletions don’t cause the dict to get resized. A dict method ._resize() to force a resize based on the current fill state may be useful for dicts which have fluctuating fill states, in order to reduce memory load.

Just checked: when placing static dict definitions into Python code, the compiler generates a BUILD_CONST_KEY_MAP bytecode, which then calls _PyDict_FromItems(), so we already have an indirect way of using the API in Python.

Why would you want to hash a frozendict ?

I guess that’s a matter of perspectives :slight_smile: Given that a dict with readonly flag set still provides the same API as one which doesn’t have the flag set (but raises exceptions for the manipulation APIs), the subtype would certainly be a possibility. Probably not the best one, agreed. Like I said: a new type is not really necessary.

The extra maintenance overhead would only pay off, if there were a potential upside for using such readonly dicts in the C interpreter to speed up lookups and reduce memory usage. I don’t know what the faster-cpython team is thinking about this, though. They appear to be moving into a different direction, from what I gather.