Importlib.resources.files and multiprocessing

I want to package a python program using multiprocessing as zipapp. To access static files I use importlib.resources.files to create a Traversable (pathlib.Path with normal execution, zipfile.Path in zipapp) at startup. This Traversable then gets used in many different places.

Sadly zipfile.Path breaks when multiple processes read at the same time and can raise zipfile.BadZipFile exceptions. This is because it wraps a zipfile.ZipFile object and that is not expected to work accross processes (SEE: zipfile with multiprocessing: zipfile.BadZipFile · Issue #83544 · python/cpython · GitHub).

Made a simple reproduction of the behaviour: GitHub - Joshix-1/importlib.resources.files-multiprocessing--bug · GitHub

It’s kinda similar to the bug fixed in github com/python/cpython/pull/146231. Doing something similar works, but I don’t like that as solution.

diff --git a/Lib/zipfile/_path/__init__.py b/Lib/zipfile/_path/__init__.py
index 80f5d60773..39a18b1c09 100644
--- a/Lib/zipfile/_path/__init__.py
+++ b/Lib/zipfile/_path/__init__.py
@@ -319,6 +319,16 @@ def __init__(self, root, at=""):
         self.root = FastLookup.make(root)
         self.at = at

+        import os
+
+        def noop(**kwargs): pass
+
+        def fix():
+            print("fix")
+            self.root = FastLookup.make(root)
+
+        getattr(os, 'register_at_fork', noop)(after_in_child=fix)
+
     def __eq__(self, other):
         """
         >>> Path(zipfile.ZipFile(io.BytesIO(), 'w')) == 'foo'

Is this something I as an application developer have to deal with? What if I make a library that uses importlib. Is there any warning anywhere that using the Traversable returned by importlib.resources.files after a call to os.fork() is not save?

I personally think that this is unexpected behaviour and could be considered a bug, but I’m not sure.

The simplest thing is just to call files() after forking, or at need. It costs 2ms over the 20ms that just loading files took:

$ hyperfine -w5 "python -c pass"                                                                            
Benchmark 1: python -c pass
  Time (mean ± σ):      11.3 ms ±   1.7 ms    [User: 7.7 ms, System: 3.4 ms]
  Range (min … max):     7.5 ms …  16.7 ms    273 runs

$ hyperfine -w5 "python -c 'from importlib.resources import files'"                                         
Benchmark 1: python -c 'from importlib.resources import files'
  Time (mean ± σ):      30.4 ms ±   5.0 ms    [User: 24.4 ms, System: 5.7 ms]
  Range (min … max):    24.2 ms …  53.5 ms    100 runs
 
$ hyperfine -w5 "python -c 'from importlib.resources import files; files(\"json\")'" 
Benchmark 1: python -c 'from importlib.resources import files; files("json")'
  Time (mean ± σ):      32.6 ms ±   3.9 ms    [User: 26.1 ms, System: 6.1 ms]
  Range (min … max):    26.4 ms …  47.6 ms    89 runs

Personally, I like to have a variable that’s tied to the module, so I can load resources relative to a module without passing around the name, which I assume is why you have PATH = files(...) at the module level, rather than inline.

I found myself writing a wrapper over and over to sit in a data/__init__.py file, and ultimately ended up extracting it into a small package: acres.

You use it like:

src/pkg/data/__init__.py:

from acres import Loader

load = Loader()

Elsewhere:

from pkg import data

def my_function():
    for resource in data.load.readable().iterdir():
        ...

    # Or, to splat to the filesystem

    with data.load.as_path() as data_dir:
        for resource in data_dir.iterdir():
            ...

    # Or, splat to filesystem on first use and cache

    for resource in data.load().iterdir():
        ...

While you can create a module-level traversable (PATH = data.load.readable()) and end up back in this situation, this inline call is more natural to me. I didn’t even know I needed to avoid sharing a ZipFile across forked processes, and my code isn’t vulnerable to this.

I like having the shared Traversable value as the module level PATH constant. Passing that around and calling joinpath on it makes it fun to use. If I would create my own wrapper it would implement Traversable and look something like the following:

import os
import pathlib
from collections.abc import Iterator
from importlib.resources import files
from importlib.resources.abc import Traversable
from types import ModuleType


class LazyTraversable(Traversable):
    """
    A lazy traversable.

    Use LazyTraversable.make(__name__) to create.
    """

    __slots__ = ("_anchor", "_path_segments")

    _anchor: str | ModuleType
    _path_segments: tuple[str | os.PathLike[str], ...]

    def __init__(
        self, anchor: str | ModuleType, *segments: str | os.PathLike[str]
    ) -> None:
        """Initialize self."""
        self._anchor = anchor
        self._path_segments = segments

    @classmethod
    def make(cls, anchor: str | ModuleType) -> Traversable:
        """
        Create a new Traversable.

        Multiprocesing-safe alternative to `importlib.resources.files`.
        """
        if isinstance(traversable := files(anchor), pathlib.Path):
            return traversable
        return cls(anchor)

    def _traversable(self) -> Traversable:
        return files(self._anchor).joinpath(*self._path_segments)

    def iterdir(self) -> Iterator[Traversable]:
        """Yield Traversable objects in self."""
        return self._traversable().iterdir()

    def is_dir(self) -> bool:
        """Return True if self is a directory."""
        return self._traversable().is_dir()

    def is_file(self) -> bool:
        """Return True if self is a file."""
        return self._traversable().is_file()

    def joinpath(self, *descendants: str | os.PathLike[str]) -> Traversable:
        """
        Return Traversable resolved with any descendants applied.

        Each descendant should be a path segment relative to self
        and each may contain multiple levels separated by
        ``posixpath.sep`` (``/``).
        """
        return LazyTraversable(self._anchor, *self._path_segments, *descendants)

    def open(self, mode="r", *args, **kwargs):
        """
        Open the file.

        mode may be 'r' or 'rb' to open as text or binary. Return a handle
        suitable for reading (same as pathlib.Path.open).

        When opening as text, accepts encoding parameters such as those
        accepted by io.TextIOWrapper.
        """
        return self._traversable().open(mode, *args, **kwargs)

    @property
    def name(self) -> str:
        """The base name of this object without any parent references."""
        return self._traversable().name


PATH = LazyTraversable.make(__name__)

Your benchmark is flawed as it’s just creating a pathlib.Path and not opening a ZipFile. So I think this wrapper could have a negative performance impact when running in a single-process zipapp.

Made a library that fixes it for me, but I’m not really happy with it.

Can’t you redesign your code to create a Traversable per process? From what I understand, you are only reading, right?

Yes I’m only reading. I’ve thought about redesigning my code, but it seems like too much work.

It’s a website using tornado and I’m creating all the request handlers using the Traversables before forking. If you are interested in the code: GitHub - asozialesnetzwerk/an-website: #1 Website in the Worlds · GitHub

Would imho be better if python didn’t have that pitfall, but my library fixes it for me.