I want to package a python program using multiprocessing as zipapp. To access static files I use importlib.resources.files to create a Traversable (pathlib.Path with normal execution, zipfile.Path in zipapp) at startup. This Traversable then gets used in many different places.
Is this something I as an application developer have to deal with? What if I make a library that uses importlib. Is there any warning anywhere that using the Traversable returned by importlib.resources.files after a call to os.fork() is not save?
I personally think that this is unexpected behaviour and could be considered a bug, but I’m not sure.
The simplest thing is just to call files() after forking, or at need. It costs 2ms over the 20ms that just loading files took:
$ hyperfine -w5 "python -c pass"
Benchmark 1: python -c pass
Time (mean ± σ): 11.3 ms ± 1.7 ms [User: 7.7 ms, System: 3.4 ms]
Range (min … max): 7.5 ms … 16.7 ms 273 runs
$ hyperfine -w5 "python -c 'from importlib.resources import files'"
Benchmark 1: python -c 'from importlib.resources import files'
Time (mean ± σ): 30.4 ms ± 5.0 ms [User: 24.4 ms, System: 5.7 ms]
Range (min … max): 24.2 ms … 53.5 ms 100 runs
$ hyperfine -w5 "python -c 'from importlib.resources import files; files(\"json\")'"
Benchmark 1: python -c 'from importlib.resources import files; files("json")'
Time (mean ± σ): 32.6 ms ± 3.9 ms [User: 26.1 ms, System: 6.1 ms]
Range (min … max): 26.4 ms … 47.6 ms 89 runs
Personally, I like to have a variable that’s tied to the module, so I can load resources relative to a module without passing around the name, which I assume is why you have PATH = files(...) at the module level, rather than inline.
I found myself writing a wrapper over and over to sit in a data/__init__.py file, and ultimately ended up extracting it into a small package: acres.
You use it like:
src/pkg/data/__init__.py:
from acres import Loader
load = Loader()
Elsewhere:
from pkg import data
def my_function():
for resource in data.load.readable().iterdir():
...
# Or, to splat to the filesystem
with data.load.as_path() as data_dir:
for resource in data_dir.iterdir():
...
# Or, splat to filesystem on first use and cache
for resource in data.load().iterdir():
...
While you can create a module-level traversable (PATH = data.load.readable()) and end up back in this situation, this inline call is more natural to me. I didn’t even know I needed to avoid sharing a ZipFile across forked processes, and my code isn’t vulnerable to this.
I like having the shared Traversable value as the module level PATH constant. Passing that around and calling joinpath on it makes it fun to use. If I would create my own wrapper it would implement Traversable and look something like the following:
import os
import pathlib
from collections.abc import Iterator
from importlib.resources import files
from importlib.resources.abc import Traversable
from types import ModuleType
class LazyTraversable(Traversable):
"""
A lazy traversable.
Use LazyTraversable.make(__name__) to create.
"""
__slots__ = ("_anchor", "_path_segments")
_anchor: str | ModuleType
_path_segments: tuple[str | os.PathLike[str], ...]
def __init__(
self, anchor: str | ModuleType, *segments: str | os.PathLike[str]
) -> None:
"""Initialize self."""
self._anchor = anchor
self._path_segments = segments
@classmethod
def make(cls, anchor: str | ModuleType) -> Traversable:
"""
Create a new Traversable.
Multiprocesing-safe alternative to `importlib.resources.files`.
"""
if isinstance(traversable := files(anchor), pathlib.Path):
return traversable
return cls(anchor)
def _traversable(self) -> Traversable:
return files(self._anchor).joinpath(*self._path_segments)
def iterdir(self) -> Iterator[Traversable]:
"""Yield Traversable objects in self."""
return self._traversable().iterdir()
def is_dir(self) -> bool:
"""Return True if self is a directory."""
return self._traversable().is_dir()
def is_file(self) -> bool:
"""Return True if self is a file."""
return self._traversable().is_file()
def joinpath(self, *descendants: str | os.PathLike[str]) -> Traversable:
"""
Return Traversable resolved with any descendants applied.
Each descendant should be a path segment relative to self
and each may contain multiple levels separated by
``posixpath.sep`` (``/``).
"""
return LazyTraversable(self._anchor, *self._path_segments, *descendants)
def open(self, mode="r", *args, **kwargs):
"""
Open the file.
mode may be 'r' or 'rb' to open as text or binary. Return a handle
suitable for reading (same as pathlib.Path.open).
When opening as text, accepts encoding parameters such as those
accepted by io.TextIOWrapper.
"""
return self._traversable().open(mode, *args, **kwargs)
@property
def name(self) -> str:
"""The base name of this object without any parent references."""
return self._traversable().name
PATH = LazyTraversable.make(__name__)
Your benchmark is flawed as it’s just creating a pathlib.Path and not opening a ZipFile. So I think this wrapper could have a negative performance impact when running in a single-process zipapp.