PEP 603: Adding a frozenmap type to collections

Paddy3118 · September 14, 2019, 6:57am

Formatting broken?
Finally off my phone and looking at an example in two browsers it seems the formating was lost somewhere. I think the formatting on one of your examples above should be:

numbers = frozenmap((i, i ** 2) for i in range(1_000_000))

with numbers.mutating() as copy:
    for i in numbers:
        if not (numbers[i] % 997):
            del copy[i]

    numbers_without_997_multiples = frozenmap(copy)

    # at this point, *numbers* still has 1_000_000 key/values, and
    # *numbers_without_997_multiples* is a copy of *numbers* without
    # values that are multiples of 997.

    for i in numbers:
       if not (numbers[i] % 593):
            del copy[i]
    
    numbers_without_593_multiples = frozenmap(copy)
    
    print(copy[10])  # will print 100.

print(copy[10]) # This will throw a ValueError as copy
# has been closed when the “with” block
# was executed.

unreliable-robot · September 14, 2019, 7:00am

The value of this PEP can’t be really understated … can I ask whether this could be extended to becoming a standard method of “casting” such as might be seen in changing int type to a string type for all classes? Eg

frozen_somenumber = frozen (some number)

somefunction(frozen (variable))

I acknowledge it broadens the scope of the PEP but a common “frozen” progressively implemented keyword may be perceived as simpler in the long term.

Paddy3118 · September 14, 2019, 7:17am

OK, now I’m off my phone and on my laptop, I can express myself more clearly. In the with numbers.mutating() as copy: example, (which I have hopefully mentioned a fix for the formatting earlier), You can hack and add to copy as needed then applying frozenmap(copy) creates the new frozen map with efficiencies for data common to both numbers and numbers_without_593_multiples.

I was wondering if numbers_without_593_multiples could be created from numbers using a comprehension expression rather than the with statement and with the same efficiencies?
(`cos I like comprehensions )

steven.daprano · September 14, 2019, 9:35am

You know Oscar, that seems horribly like you are complaining that the

code is too fast and efficient. "Why would I want a fast constant time

operation when I could have a slow O(N) operation???"

wink

I believe that the idea is that the second mapping is a modification of

the first. We frequently do that with other immutable collections:

new_tuple = old_tuple[:5] + (2, 3, 4) + my_tuple(1:8)

and especially strings. I don’t think I like this “mutating” method, for

reasons I shall explain in another post, but the idea of copy-with-

variation seems very useful.

steven.daprano · September 14, 2019, 10:29am

All numeric types in the standard library are already frozen
(immutable), so that would be a no-op.

See PEP 351.

steven.daprano · September 14, 2019, 10:30am

Immutable dicts have been requested and implemented many times.
PEP 416
describes some of the many third-party implementations.

Here is one which doesn’t seem to be listed in the PEP (perhaps
it is too new?):

Another implementation not found in the PEP is a port of immutable
datastructures from Clojure, Funktown.

Discussions on Stackoverflow:

Another discussion on python-list
(note that I no longer hold the opinions expressed in that thread).

A very old PEP proposed a freeze protocol
which would allow freezing dicts.

A couple of articles justifying why immutable dicts would be useful:

https://www.cs.toronto.edu/~tijmen/programming/immutableDictionaries.html

I’m not sure whether the specific implementation and API of this
frozenmap is a good one, but some kind of immutable, hashable mapping
seems to be commonly requested and commonly re-invented.

steven.daprano · September 14, 2019, 11:15am

Just a few observations on the PEP.

That’s not what dict accepts.

I think it might be simpler to simply say that frozenmap duplicates the

dict constructor API.

Why the limitation to a single key and value at a time?

Given that dicts have an “update” method, perhaps “updated” would be a

better name for the method, with a similar API:

frozenmap.updated([obj,] \*\*kwargs)

where obj (if present) is either a mapping with a keys() method, or

an iterable of (key, value) pairs.

Likewise for the “excluding” method – why on a single key at a time?

Is it intentional that the union method accepts None as the first

positional argument? I think that should be an error.

oscarbenjamin · September 14, 2019, 11:21am

Immutable dicts have been requested many times and it’s not hard to see situations where they could be used. The PEP emphasises though that this is not an immutable “dict” because it differs in some ways. I haven’t seen a request for something with these different characteristics. Also the PEP defines new methods/types that don’t exist for dict: I haven’t seen requests for those either.

An immutable dict on its own would seem a simpler proposal than the PEP. The PEP might be better but I would say that I haven’t seen a user request for the proposed functionality as an alternative to an immutable dict.

Oscar

steve.dower · September 14, 2019, 12:31pm

Yury may respond for himself, but I did have this discussion with him and the primary issue is that dict now retains insertion order and this one does not (at least not without sacrificing its performance benefits). So we can’t provide the equivalent of the current dict, though the 3.6 and earlier one would have been fine (note that 3.6 did preserve insertion order, but it was not guaranteed as part of the dict specification).

Otherwise, despite lookup technically not being O(1) (it’s logN, where the base is 32 which makes it close enough to O(1) for the first few billion or so elements), it probably could act as an equivalent for dict. I too wish it could be, but that ship sailed when we guaranteed insertion order.

jimbaker · September 14, 2019, 6:35pm

(I’m a new user on discuss… consequently I only get two links… sorry, you will have to go search for them!)

Please add a question/answer pair for “How difficult will this be for other implementations (PyPy, CircuitPython, MicroPython, IronPython, Jython)?”

Jython would use the existing and substantial persistent collection work in the JVM space. In particular, kotlinx.collections.immutable would be a good starting point. Interestingly, it also supports insertion ordering! No need for frozenmap, we can have frozendict in Python. Read on if you are interested.

Per CHAMP algorithm is current state of the art for immutable/persistent collections · Issue #6 · Kotlin/kotlinx.collections.immutable · GitHub

@ilya-g, yes CHAMP can be used to implement hash-sets or hash-maps that preserve element insertion order. I implemented such a data structure last year for another project and now started to back port that data structure to the github repo usethesource/capsule (removed link - jimbaker). You can expect the implementation to appear in the codebase by September [2016].

It was in fact in fact implemented; the implementation for this code, in Kotlin, can be found in the github repo for kotlinx.collections.immutable, specifically persistentOrderedMap

The algorithm is documented in papers/presentations that are linked from the github repo usethesource/capsule, especially his PhD thesis. It is possible that this algorithm is too tightly tied to the JVM, eg code generation, but worthwhile to explore further. (I think the codegen is more about layout/access, we do similar things in Jython for Python arrays, but I have not yet read closely this dissertation or the Kotlin code.)

It would be cool if CPython would port code from the JVM space. Return the favor? Regardless, very much doable.

rhettinger · September 15, 2019, 9:56pm

Yes, freezing a dict has come up many times before. It is such an obvious thing to think about.

This PEP-603 is somewhat different and unprecedented. It proposes a complete new mapping type base on HAMT and that is optimized for fast copies at the expense of lookups. It is profoundly different. As far as I can tell, this function has never been requested (let alone imagined) in Python’s history.

If all we want is to freeze a dict, there are a lot of simple ways to do that. But the PEP is about something different – it is algorithm focused. Usually, we don’t do that. We offer core APIs when thery are essential and talk very little about the implementation choice. We don’t have module littered with every interesting data structure we can think of (treaps, skiplists, pairing heaps, HAMT, etc. That is left for PyPI.

I would really like to see multiple cases of real-world code that would be substantially improved by having this in the standard library.

aeros · September 16, 2019, 3:42am

Is there any particular reason as to why this hasn’t already been implemented? Regardless of whether or not this PEP is accepted, the demand for a standard frozendict seems to be rather unanimous.

This has been on PyPI as “immutables” since December 2018, so perhaps some download and usage statistics would provide some useful insight. Testimonies from users of the package would also be helpful in determining some of the pragmatic applications.

rhettinger · September 18, 2019, 2:35am

IIRC, previous efforts stalled to due lack of real interest and need. Once we had frozenset(), it was obvious to think of frozendict(), but it didn’t seem to be essential enough for anyone to push it through. There are some implementations on PyPI but they don’t seem to have gained traction. Even frozenset() is in the rarely used category.

alexchamberlain · September 18, 2019, 8:36pm

I think this is a great idea; can’t wait to start using it.

I’ve got a couple of questions…

A persistent data structure is defined as a data structure that
preserves the previous version of the data when the data is modified.

Does it give you access to the previous versions explicitly, or is this implicit by holding onto references to the previous version?

frozenmap(collection) … an object with an items() method that is expected to return
a series of key/value tuples

What is the use case for this? I think frozenmap(collection.items()) is sufficient here, though I think dict should be supported explicitly.

On the including API, I tend to agree with others that the use cases seem to be covered by union. Would it be possible and more efficient to provide an API that excluded some keys, whilst adding others in a single operation/version/“transaction”? Something like

m = frozenmap(foo=1, bar=100)
m2 = m.change(include={"baz": 42}, exclude=["foo"])
assert m2 == frozenmap(baz=42, bar=100)

On FrozenMapCopy, the description covers the API mechanics well, but I don’t understand the motivation behind .close/context manager - what’s the use case for this?

aeros · September 19, 2019, 12:50am

Hmm, if even frozenset hasn’t seen that much usage across the user base and the more standard implementations of frozendict on PyPI haven’t seen much usage, I can see why you might be hesitant about the amount of usage that frozenmap will receive. To get a rough idea for user demand, I decided to start looking into the download statistics via pypistats.org.

The package immutables has had a significant number of downloads since it’s creation: PyPI Download Stats. The recent popularity might be slightly inflated due to the PEP proposal, but the daily downloads from April 2019 to now have sustained in the several thousands so there does seem to be some demand for this.

For comparison, the package frozendict that simply implements a frozen dictionary without any of the other changes has similar popularity: https://pypistats.org/packages/frozendict. The other packages on PyPI were significantly less popular, so I think this is the main implementation for a standard frozendict.

Of course, the download metrics alone are not an entirely reliable indicator of a package’s usage. They would be significantly more useful if downloads from unique IP addresses were tracked separately. But either way, they’re still useful for a gathering a rough approximation of user demand.

dimaqq · September 19, 2019, 1:19am

On the PEP itself

The PEP text doesn’t appear to clearly state if it’s the keys or the values or both that cannot be changed. Please make that clear.

On order

It took a long while to get dicts that remember the key order, I feel this proposal is incomplete if order is lost.

On naming

My gut tells me it should be frozendict.

.including() seems weird, in a sense that it’s not the word I’d naturally call it.
attrs uses .evolve() which is also an odd name.
dataclasses.replace() is cumbersome.
perhaps the most common operation should be called .update(kwarg1=..., kwarg2=...)?
or binary operations could be used, like d1 + d2 and d1 + dict(a=42) and d1 - ("name",)?

On CoW mutable API

The idea is great, but API needs polish.
Contrast with immer.js (adapted to fake Python syntax): state = immer.produce(state, lambda draft: { draft["x"] = 2*state["y"]; }) which of course won’t work as lambdas cannot have statements…

The idea of a context manager is really nice, and perhaps it could be more concise:

earth = forzendict(radius=1e7, ...)
with earth as moon:
    moon.diameter = 1.7e6
print(moon)

pf_moore · September 19, 2019, 11:16am

My understanding is that the proposal is for a mapping type that is better known in the functional programming community (Haskell, et al) where immutability is the norm. And in those contexts, I believe it is a useful data structure. As such, this seems like something that would be entirely reasonable as a PyPI module - whether it would get high usage or remain niche is less relevant in that context.

However, in this case, the implementation is already almost entirely present in core Python because it is used in the implementation of an existing feature (context variables). So that changes the balance somewhat - publishing on PyPI would mean duplicating (and keeping in sync with) the core implementation, whereas exposing the type in the stdlib would be mostly just a case of layering an API on top of the existing implementation.

While adding a class to the stdlib “because we can” is not a good justification, the fact that the implementation already exists does, in my view, lower the bar for how much benefit needs to be proved in advance. It should also be noted that context variables are (apparently ) an existing use case - albeit one that we’ve now provided so that users don’t need to reimplement it.

Maybe a useful approach would be to expose the type as an experimental feature, with the API (the part of the implementation that has not yet been proven) subject to change based on usage and experience?

I will say, however, that I’ve never yet needed a frozen dictionary in my code. So any use I can (currently) imagine would just be experimenting with a novel and potentially useful data type.

Paddy3118 · September 19, 2019, 1:34pm

But invaluable when you want a set of sets, or sets in dict keys. Here’s a use: http://rosettacode.org/wiki/Peaceful_chess_queen_armies#Python

cben · September 19, 2019, 5:15pm

@yselivanov the 2nd graph pep-0550-lookup_hamt.png is broken in https://www.python.org/dev/peps/pep-0603/.
Both graphs are broken in your post here, as are many code blocks, especially with numbers.mutating() ...
(maybe pandoc -f rst -t markdown can help)

cben · September 19, 2019, 5:46pm

Love the direction! Not sure this is the ideal answer, but happy it’s getting discussion.

I think the PEP could better explain its motivation by the following simple question:

If a mapping type is immutable, how does one actually populate it?

For tuples & frozensets, a straightforward & efficient answer is frequently “from an iterator”;
For mappings, it’s very common to need more complicated logic, frequently reading back previous value, e.g.:

d.setdefault(x, {}).setdefault(y, 0)
d[x][y] += 1

The PEP generally needs more worked scenarios of populating the mapping, specifically covering cases that proved annoying with dict — nested dictionaries, defaultdict.
I think the proposed interface is missing a .setdefault() equivalent — copy and add a k/v pair iff the key is not there yet?

Of course, one could populate a mutable dict, then freeze it. This is what all “just a frozen dict” implemetations force you to do. This requires O(N) extra work to make a frozen copy.
Wait, that’s actually asymptotically faster, compared to N calls to including() each taking O(log N)!
But copying a dict becomes much slower AND memory-hungry when you actually need persistence — when you populate, take a snapshot, populate some more, take a snapshot… Use cases:

You’re working with conceptually mutable state, but want to retain frequent snapshots.
Backtracking — passing down modified state, returning to current state after every call. (Example: chess move search)
You’re building/changing the mapping by a “pipeline” of small steps, and want to express each as pure function, without mutation.

It’s more about convenience than performance?

Some parts of the interface, notably “Calling the union() once to add/replace N keys” and .mutating(), feel like optimizations.

However, FrozenMapCopy get/set is still O(log N) which means it’s not strictly optimal, even “when the frozenmap in question contains thousands of key/value pairs and there’s a need to update many of them” => when doing sufficiently heavy processing, it’s asymptotically possible for conversion to dict and back to beat FrozenMapCopy (?)

Is mutating a FrozenMapCopy M times m[k] = v any faster than doing m = m.including(k, v) M times? Please clarify this in the PEP.
Please document the complexity of constructing from a large mapping. Is it O(N · log N) or better?

However, as the PEP says, the log(N) is nearly constant in practice, and the exact performance misses the point of immutable data structures.
The point is convenience/elegance and safety, with good enough performance. (Of all operations, including copy. Under this view, it’s actually dict and list that are weird, they’re optimized for fast access/mutation at expense of copy taking prohibitive O(N), which warped coding style so deeply that we imperative programmers no longer notice it ).

The point of .mutating() is nicely exposing the duality between immutable and mutable coding style, while sharing the same good-enough persistent representation underneath.

Memory is more compelling than speed

This is actually an aspect of performance that easily beats dict — with many snapshots much of the memory can be shared (similar to how Git stores many commits densely)

Can you add some benchmark showing memory usage, with sharing?
Ideally compare it also to a deep chain of ChainMap, which is current “poor man’s persistent map”.

Could `dict` benefit from `.including(k, v)`, `.excluding(k)` methods?

Part of the gain in this PEP is introducing easy “copy with changes” methods that were so far lacking on dict! Maybe those are worth adding?

For comparison, Ruby hashes are very similar to Python dicts but expose a rich set of non-mutating methods like h.merge(k1 => v1, k2 => v2), h.slice(k1, k2), h.select { ... }, h.reject { ... } etc.
(most of which have in-place mutating versions .merge!, .reject! etc.)
Code like options = options.merge(k => v) is extremely common in Ruby to avoid mutating your arguments.

[Strawman: Python collections did historically shy away from exposing operations they can’t implement efficiently. But most collections do expose some expensive but handy operations: lst[100:200] and [1] + lst + [2] for lists; dict(d, k1=v1, k2=v2) or dict(d1, **d2) for merging dict as long as keys are strings; most set operations — and all these are widely used. Also dict.keys() / .values() / .items() were expensive in python 2 and nobody cared.]