Freezing data type for parallel access

szalpal · August 30, 2024, 8:54am

TLDR

Proposal for a parallel-read-access dictionary and potentially other data structures to complement free-threaded Python.

Background

Together with @Rostan , we have recently been experimenting with free-threaded Python. One of the exercises we conducted involved engaging parallel threads in data processing using NLTK (tokenization and lemmatization) and Pillow (image decoding). Interestingly, we did not achieve the performance gains we expected.

Our investigation showed that both of these extensions (and we believe this is common practice) maintain the global state in a dictionary. As a result, when functions run in parallel, the global-state-dictionaries are accessed in parallel using atomics, which slows down the entire process. Even though atomics were supposed to maintain good performance, they are still slow under contention.

Idea

The conclusion above led us to an idea: along with removing the GIL from Python, we believe that a mechanism allowing data structures to be read in parallel faster might be useful.

Importantly, our idea goes beyond simple immutability - making an object immutable is possible from the Python frontend. However, to achieve parallel reading, we need to modify the backend API.

So far, we have identified a handful of previously proposed solutions that address this problem to some extent:

PEP 603 (frozenmap type) - Draft,
PEP 416 (frozendict builtin type) - Rejected,
PEP 351 (The freeze protocol) - Rejected,
python-frozendict,
immutabledict,
MappingProxyType.

As mentioned earlier, existing solutions on PyPI focus on introducing immutable data types (e.g. immutabledict) on the Python level. This would likely be insufficient, as they use the underlying implementation of dict, whose C-level API does not allow for parallel reading. Furthermore, PEPs 416 and 351 were rejected in the past for good reasons at the time. However, since work on free-threading is ongoing, the situation has changed, and perhaps now would be a good moment to revisit these discussions. Lastly, PEP 603 proposes a frozenmap datatype, which actually aligns with our idea. The downside to the proposed approach is that any modification of a frozenmap requires copying the entire object.

Beyond the idea of frozenmap presented in PEP 603, we are also considering other (bolder) solutions, which are not restricted to dictionaries. The inspiration for those is PEP 351. Unlike frozenmap, the two proposed approaches do not involve a copy of the whole object within the action of modifying it. The ideas are:

Introducing an obj.un/freeze()^[1] methods for built-in data types. Upon calling the method, the interpreter will lock the data type object for modification and allow parallel read access.
Introducing a dunder method __un/freeze__, which could correspond to a keyword like freeze obj (similarly to del obj).

We believe that these approaches fit well in the scenarios like described at the top (i.e. keeping a global state in a dictionary - or any other data type). In such cases it’s natural to have an initialization phase first, then freezing the data and reading from it. Contrary to overwriting and reassigning a global variable.

How do you feel about the idea of having a parallel-read-access dictionary? Would it be a good idea to expand this to other (perhaps all) built-in data types? Or do you think we should explore an entirely different approach to solving parallel access issues?

We acknowledge that there is some controversy around calling this operation freeze because—depending on the implementation—it may actually involve copying. Since our proposal involves no copy, we are keeping this name. Naturally, we are not tightly attached to it and the final name would be decided at a later stage. ↩︎

ZeroIntensity · August 30, 2024, 12:33pm

Adding new APIs for free threading probably won’t work right now, since free-threading isn’t the default build – wouldn’t freezing objects on builds with the GIL (nearly all of Python!) slow things down on those?

Maybe, if the problem is reference count contention, we could add an API for enabling deferred reference counting on user-defined objects (pretty much just publicizing _PyObject_SetDeferredRefcount, and subsequently adding a Python API in gc) – that would be much more portable, and wouldn’t affect performance on the default builds.

ngoldbaum · August 30, 2024, 2:59pm

I’d love to see more detail about how you set your benchmark up and how you diagnosed where the slowdown came from.

Rostan · August 30, 2024, 3:04pm

Adding new APIs for free threading probably won’t work right now, since free-threading isn’t the default build – wouldn’t freezing objects on builds with the GIL (nearly all of Python!) slow things down on those?

This proposal doesn’t only benefit to free-threaded Python. While it’s true that we care about performance considerations in the context of free-threaded Python, immutability has other benefits and makes code easier to reason about. I’m not sure why allowing the possibility to freeze objects would hurt performance on regular builds.

Maybe, if the problem is reference count contention, we could add an API for enabling deferred reference counting on user-defined objects (pretty much just publicizing _PyObject_SetDeferredRefcount , and subsequently adding a Python API in gc ) – that would be much more portable, and wouldn’t affect performance on the default builds.

About exposing deferred rc for user-defined objects, while this would help mitigate the contention issue, I suspect that it will be actually used by only a small number of Python users who (i) know about deferred rc (ii) care about the fact that contention on state that is frequently accessed by multiple threads can hurt performance.

On the other hand, encouraging immutability has advantages in and of itself and has the nice consequence of addressing the contention problem.

a-reich · August 30, 2024, 3:39pm

You may also be interested in the PyCon talk by Yury Selivanov (of PEP 603, and using the same data structure) “Overcoming GIL with subinterpreters and immutability”.

Rostan · August 30, 2024, 3:50pm

I’d love to see more detail about how you set your benchmark up and how you diagnosed where the slowdown came from.

The way we initially noticed that there was an issue is by running nltk both in thread pools and process pools and giving a better throughput with process pools. To confirm the root cause we wrote a script that does basically nothing apart from accessing a dict in parallel.

Script

#!/usr/bin/env python3

import time
from concurrent.futures import ThreadPoolExecutor

import matplotlib.pyplot as plt

N = 1_000_000


def apply_map(mapper: dict[int, int]):
    for num in range(N):
        if num in mapper:
            mapper[num]


def main():
    identity = {str(num): num for num in range(N)}

    latencies = []
    all_threads = range(1, 16)

    for nb_threads in all_threads:
        print(f"Number of threads: {nb_threads}")
        start = time.monotonic()
        with ThreadPoolExecutor(max_workers=nb_threads) as executor:
            for _ in range(nb_threads):
                executor.submit(apply_map, identity)
        end = time.monotonic()

        print(end - start)
        latencies.append(end - start)

    plt.bar(all_threads, latencies)
    plt.show()


if __name__ == "__main__":
    main()

Here are the results I got on my machine:

Configuration

$ python -VV
Python 3.13.0rc1+ experimental free-threading build (heads/3.13-dirty:5709cee0b6, Aug 24 2024, 22:17:17) [GCC 11.4.0]
$ nproc --all
32

barry · August 30, 2024, 10:04pm

I lamented a bit on the rejection of 351 at the time, but maybe it was just 20 years too early. We could retroactively accept it and it wouldn’t be the first time for one of my PEPs.

One thing to keep in mind, and it was an open question for 351, was whether we need an unfreeze protocol. I think the use case for a freeze protocol is more compelling, especially now with free threading on the horizon, but I’m not so sure about unfreeze.

Two other potentially related topics are PEP 509 which added a version counter to the dict implementation, and PEP 609 which rescinded that feature. While ostensibly added as a guard for optimization, I could imagine that a version field could at least be used to determine whether any changes are made to a dictionary between reads. It wouldn’t in and of itself solve the concurrency problem, but it could come in handy, although I don’t quite see just how.

ncoghlan · August 30, 2024, 11:55pm

As another modern use case for freezing objects that has nothing to do with threading: every time I use defaultdict I wish I could lock it after populating it, rather than having to copy the contents out to a new regular dict.

For unfreezing, I think we already have a protocol: the copy protocol. Unfreezing in-place in the presence of free threading seems… perilous.

NeilGirdhar · August 31, 2024, 5:42am

Another modern use case for freezing is safe narrowing for containers. If you type-check that a container contains only integers before storing it as a member variable, you may convince the type checker that it’s Set[int], but the caller is free to put whatever it wants into the set, which breaks your assumption. If, on the other hand, you freeze the container after checking, then it is definitely Set[int] as desired.

barry · August 31, 2024, 7:44pm

Point of order: it seems like there at least isn’t overwhelming opposition to resurrecting a way to freeze objects. I’d be happy to re-open PEP 351 for reconsideration. I’m not sure we’d need a PEP to superceed it. 351 is pretty simple and probably about all we need.

+1

ZeroIntensity · August 31, 2024, 10:58pm

Hmm, I think a new, modernized, PEP would make more sense. PEP 351 should remain for historical purposes, and then the new one should link to that with an updated motivation and rationale mentioning free-threading.

ncoghlan · August 31, 2024, 11:50pm

While the protocol proposal could potentially be the same, I think we do need a new PEP to explain what has changed since PEP 351 was rejected. (The python-dev post linked from the PEP describes some still legitimate concerns, but the utility of MemHive’s HAMT mappings also provide a counterexample to the dict case)

It isn’t the same situation as when older PEPs were deferred or withdrawn by their authors.

In addition to the open questions still in PEP 351, a modern PEP would also need to discuss the interaction with type checkers.

Both the default dict use case and the more general “locking a mutable container after populating it” use case would also benefit from the “freeze in place” version of the protocol that makes mutation methods start throwing exceptions rather than removing them entirely.

I also think the relationship to the copy module needs to be considered further:

if the base protocol freezes in-place (perhaps even calling the method __ifreeze__ rather than __freeze__), then freeze and its protocol method should be mutating operations that return None
by analogy with sort vs sorted, the make-an-immutable-copy version would be frozen. If the relevant methods are defined, that could use the copy protocol with in-place freezing, but __frozen__ could also be implemented directly (either for efficiency or because in-place freezing isn’t supported)
the shallow freezing vs deep freezing question needs to be considered (I would suggest shallow freezing by default, with deep_freeze and deeply_frozen as recursive versions)
rather than builtins or a new module, this could all be added to the copy module to emphasise the relationship between the protocols, especially if thawed variants are added (I realised that using the copy protocol for thawing doesn’t work when the mutable and immutable versions of a container are genuinely different types). Thawing should always produce a new object, though (due to the concurrency problems with thawing in-place)

Finally, I think the PEP 603 thread at PEP 603: Adding a frozenmap type to collections would be worth reviewing for potential ideas (especially around temporarily thawing objects to more efficiently produce derived versions).

a-reich · September 1, 2024, 3:54am

Similar to the free-threading use case, I’m wondering if we should keep in mind (isolated) subinterpreters. The deferred PEP 734 supports sharing/sending data across interpreters but only for a limited set of basic immutable types; could some version of a freezing protocol work well for “send an immutable copy of this object to the other interpreters”?

mikeshardmind · September 1, 2024, 4:54am

If this is being resurrected, I’ll add my support for a freeze protocol. I’m against unfreezing being possible, except potentially in CPython internals with an exposed API to make a mutable copy/derivatives efficiently that still upholds what people would expect from something that is frozen. Unfreezing being possible removes many of the benefits here for static analysis and for concurrent use, as one then needs to check that it hasn’t been unfrozen since, and there would need to be some sort of means of seeing the last time it was frozen/unfrozen attached.

Rostan · September 1, 2024, 12:28pm

Deep freezing seems more sane and should perhaps be preferred.

One of the arguments for the rejection of PEP 351 is that immutable dicts can’t serve as keys as their values can be mutated. This is a valid concern for dicts but also for any immutable object with mutable parts. Deep freezing should help mitigate such issues. More generally, making it possible to mutate parts of a supposedly immutable object seems to be a bad thing.

Another argument in the rejection notice of PEP 351 is the Liskov violation introduced by viewing tuple as immutable lists. I think that it is a direct consequence of the fact that __freeze__ here is not in-place and returns another object. Even if care is taken to make the API of the immutable class compatible with that of the mutable one, this is error prone as the two need to stay synchronized over time.

I think that this alone is a good reason to favor in-place freezing. There are other reasons such as making the “locking a mutable container after populating it" use case easier to express.

ncoghlan · September 1, 2024, 2:11pm

The static typing folks actually conceded that point a while back: while tuple[A] indicates a 1-tuple, and tuple[A, B, C] indicates a heterogenous 3-tuple, tuple[A, ...] represents an arbitrary length tuple of A instances (specifically corresponding to an immutable equivalent to list[A]).

It’s one of the things that a new PEP should note as changing since PEP 351 was rejected: rather than continuing to argue that using a tuple as a frozenlist equivalent is incorrect and should be avoided, it has been explicitly acknowledged that it is a valid approach (with no obvious alternatives), and hence it is OK if the typeshed stub for list.__frozen__ ends up looking like:

def __frozen__(self) -> tuple[T, ...]:
    ...

Yes, definitely. I was going to note that was the topic of Yury’s talk linked earlier in the thread, and then I noticed that was your link

The MemHive performance numbers comparing the low cost of HAMT sharing with dict serialisation overhead are spectacular.

Rostan · September 1, 2024, 3:50pm

Thanks for the clarification!

My point however is not specifically about lists and tuples but frozen types that have an incompatible API with their unfrozen counterparts. It might be undesirable to have A.__frozen__ returning an object of class B if A has methods that B doesn’t (or worse, A.foo and B.foo don’t exactly do the same things).

Even for the case of lists/tuples, code that assumes a compatible API between the mutable and immutable objects might not work.

>>> x = [1, 2, 3]
>>> x + [4]
[1, 2, 3, 4]
>>> y = frozen(x)
>>> y + [4]
TypeError: ...

Again, I use lists and tuples here but my point is only that it can be hard to have (and maintain) a fully compatible interface between the mutable type and the immutable one.

ZeroIntensity · September 1, 2024, 9:28pm

Should objects returned by __freeze__ use deferred reference counting, or perhaps be immortal? That should significantly help performance on the free-threaded builds.

vstinner · September 3, 2024, 1:16am

I’m not sure about the downsize: it’s based on HAMT which is very efficient for modifications.

dimaqq · September 5, 2024, 1:13pm

Strong -1 from me.

Issues with the proposal

del thing and freeze thing is a false analogy. Del acts on the name/reference or reference expression like del arr[0], while the proposed freeze acts on the object itself.

Suppose OP use case was written with freeze/unfreeze, N threads collaborate on a task, but each individual thread may unfreeze the object. This means that external coordination between threads is required, and only the last thread should unfreeze. While that could be implemented correctly, the code could also be non trivial and contain side effects. What if there’s another N+1st thread that uses same library but performs an unrelated task. Who gets to freeze and unfreeze then? What if another thread runs unittest.mock.patch?

Conceptual issues

I feel that given the specific performance issue in the OP, the proposed solution would operate at the wrong level of abstraction.

If the global dictionary is a private data container, threads ought to make private copies on entry. The best existing mechanism is contextvar.copy_context; a custom dict or dataclass copy could be likewise implemented by the specific library.

If the global dict is the module namespace, I would first question whether the problem in fact translates to generic use cases. That is, given a hypothetical fast-read-slow-write dict implementation that holds the module namespace would in fact make general purpose code significantly faster.

I would recommend making a custom build where contention is e.g. 2x slower (faster and correct is hard; slower and correct trivial) as a prototype and running a wide set of benchmarks. Once again, real code like web apps, scikit, tf, sqlalchemy, etc. and not micro-benchmarks.

All that being said, I would expect that module dict and instance dict contention to be valid targets for optimisation as part of free-threaded work.

Outline of a solution

Best performer would probably be an RCU, however I suspect that userland-only RCU may not be possible, and if we could get kernel help, I’m not sure how that would look like across all the platforms. The C API and/or ABI stability is likely to be a mess. Maybe, just maybe, free-threaded Python could do something about the grace period or quiescent state.

CoW dict: I’m not sure about the performance trade-off in general code. Copies are not immortal, we’d be trading access contention for gc contention.

Java-style (sharded) concurrent hashmap?

A lock-free skip list perhaps?

A bog standard Read-Write Lock / shared_mutex or maybe borrow from PyPy?

P.S. I love the immutable types discussions as much as anyone, and am a bit bummed out by current state of Python containers wrt. immutability , but that’s not the solution to what OP observed. Let’s not conflate the two lest we never get anywhere with either.