Tracking (and isolating) imports without sys.modules caching

dseomn · June 13, 2025, 8:27pm

Context in case this is an XY problem: I’m working on a static site generator using jinja templates. I want to let the jinja templates use python code specific to the static site project. I also want to do incremental builds as quickly and correctly as possible, which means keeping track of which python files are imported so that a jinja template can be rebuilt when a python file it imports changes.

Actual question: I think the code below is working to both track all project-local python files that a jinja template imports (with self._fs.add_dependency(module_path)) and to isolate (not for security) those files from the static site generator’s own code. But returning a module name of f"<ginjarator imported module: {fullname}>" seems like a huge hack that could easily break in the future. Is there a better way to do this? Or is it actually ok for a MetaPathFinder to return a spec with an invalid module name?

(Please ignore the TODO comments, those are things that I can figure out easily myself after I figure out the right general approach to take here.)

# SPDX-FileCopyrightText: 2025 David Mandelberg <david@mandelberg.org>
#
# SPDX-License-Identifier: Apache-2.0

_enabled_finder: contextvars.ContextVar["_MetaPathFinder"] = (
    contextvars.ContextVar("_enabled_finder")
)


class _MetaPathFinder(importlib.abc.MetaPathFinder):
    """Finder for local python modules in a Filesystem."""

    def __init__(
        self,
        *,
        fs: filesystem.Filesystem,
        path: pathlib.Path,
    ) -> None:
        """Initializer.

        Args:
            fs: Filesystem access.
            path: Where to look for project-local modules.
        """
        self._fs = fs
        self._path = path

    @override
    def find_spec(
        self,
        fullname: str,
        path: Sequence[str] | None,
        target: types.ModuleType | None = None,
    ) -> importlib.machinery.ModuleSpec | None:
        del path, target  # Unused.
        if _enabled_finder.get(None) is not self:
            return None
        module_path = self._fs.resolve(
            self._path / (fullname.replace(".", "/") + ".py")
        )
        if not module_path.exists():
            # TODO: add a weak dependency, in depfile not dyndep?
            # TODO: this is a race condition, use read_text instead
            return None
        self._fs.add_dependency(module_path)
        # sys.modules caching could prevent this code from tracking all
        # dependencies, and it could leak local modules from the project being
        # built into ginjarator itself. This returns a spec with a different
        # name than what was requested to disable caching, and uses an invalid
        # name so it doesn't affect any normal imports.
        fake_fullname = f"<ginjarator imported module: {fullname}>"
        return importlib.util.spec_from_loader(
            fake_fullname,
            importlib.machinery.SourceFileLoader(
                fake_fullname,
                str(module_path),
            ),
        )

    @contextlib.contextmanager
    def enable(self) -> Generator[None, None, None]:
        """Returns a context manager that temporarily enables this finder."""
        with contextlib.ExitStack() as stack:
            token = _enabled_finder.set(self)
            stack.callback(_enabled_finder.reset, token)
            sys.meta_path.append(self)  # TODO: before normal path?
            stack.callback(sys.meta_path.remove, self)
            yield

MegaIng · June 13, 2025, 9:36pm

I think I would have tried to hijack __import__ instead. That allows you track all actual import statements.

dseomn · June 13, 2025, 10:00pm

I tried overriding module.__import__ for only the relevant modules, but it didn’t seem to have any effect. I could override the global builtins.__import__ of course, but that seems like it would make it harder to separately track imports from different templates in different threads if I end up doing that^[1]. Though I guess I could put the current template in a contextvar, and then read that contextvar in __import__.

It’s also inherently global so if any other framework needs to override __import__ for its own purposes, that would conflict. I don’t know if this is an issue in practice off the top of my head, but I think pytest uses a bunch of hacks to get good error messages for assertions and I wouldn’t be too surprised if it did something for __import__ too.

I currently render a small number of templates in series for bootstrapping, then use ninja to run one python process per render after bootstrapping is done. If the bootstrap step ends up being too slow, I’ll look into parallelizing it within python, since using ninja there seems complicated. ↩︎

MegaIng · June 13, 2025, 10:04pm

I don’t know exactly what kind of problems you are running into, but can’t you use the various methods to get the calling frame and deduce the module based on that? Otherwise contextvar is also a good solution. Both of these are less hacky than what you are currently doing.

Composing __import__ wrappers shouldn’t be that difficult and they also are quite rare - I would be surprised if pytest uses one.

elis.byberi · June 13, 2025, 11:51pm

Tracking (and isolating) imports without sys.modules caching

Yes, you can use importlib and store imported modules in your own dictionary. Since import is just syntactic sugar for __import__(), you can define a custom importer() function to handle loading and tracking modules. Then, simply use your importer() function in place of regular import statements.

dseomn · June 14, 2025, 1:36am

I tried out an implementation of __import__ and set both builtins.__import__ and importlib.__import__ to the wrapper. It seems to work fine for import statements, but I don’t think it’s being called at all by importlib.import_module? From cpython/Lib/importlib/__init__.py at 56eabea056ae1da49b55e0f794c9957008f8d20a · python/cpython · GitHub it looks like that’s using a different import mechanism.

I looked through the importlib code a bit to see if there’s a way to modify its behavior, but I don’t think there’s anything simple. So I guess if I go this route with __import__, I’ll have to call that directly instead of using importlib.import_module? That’s probably a fine enough trade-off.

Then, simply use your importer() function in place of regular import statements.

That would require modifications to the code being imported, which would make it harder to use with normal static analysis and testing tools.

elis.byberi · June 14, 2025, 1:52am

In that case, you’ll need to follow the proper import protocol. See Finders and loaders

You may already be following this path; just sharing the link for completeness.

dseomn · June 14, 2025, 2:07am

Yup, that’s what I used in the first post. The problem, as 5. The import system — Python 3.13.5 documentation says is that “Meta hooks are called at the start of import processing, before any other import processing has occurred, other than sys.modules cache look up.” That means that if one template imports a module and it gets cached in sys.modules, another template importing the same module wouldn’t normally call my hook, so that dependency wouldn’t be tracked.

elis.byberi · June 14, 2025, 2:38am

You only need to track modules once and update them when their source (file, string, etc.) changes. So why should each template track its imported modules?

dseomn · June 16, 2025, 1:07am

I did some more testing, and I haven’t been able to get overriding __import__ to track all imports. With the code and test below, the print(name) line shows ginjarator__python_test__test_api_module.module1 and ginjarator__python_test__test_api_module, but not module2 or module3.

At this point, I’m leaning towards just giving up on rendering multiple templates in the same python process. If I give up on that, then I can just do the import normally, and look at the spec origins in sys.modules afterwards.

Or maybe at some point in the future I can use PEP 734 – Multiple Interpreters in the Stdlib | peps.python.org to use a different interpreter for each template.

# SPDX-FileCopyrightText: 2025 David Mandelberg <david@mandelberg.org>
#
# SPDX-License-Identifier: Apache-2.0
_imported_origins: contextvars.ContextVar[set[str]] = contextvars.ContextVar(
    "_imported_origins"
)

_original_import = builtins.__import__


def _import_wrapper(
    name: str,
    globals: (  # pylint: disable=redefined-builtin
        Mapping[str, Any] | None
    ) = None,
    locals: Any = None,  # pylint: disable=redefined-builtin
    fromlist: Sequence[str] = (),
    level: int = 0,
) -> types.ModuleType:
    print(name)  # TODO
    imported = _original_import(name, globals, locals, fromlist, level)
    # TODO: move `if imported_origins ...` condition here
    module = imported
    if not fromlist:
        for attr in name.split(".")[1:]:
            module = getattr(module, attr)
    if module.__spec__ is not None and module.__spec__.origin is not None:
        print(module.__spec__.origin)  # TODO
        if (imported_origins := _imported_origins.get(None)) is not None:
            imported_origins.add(module.__spec__.origin)
    return imported


builtins.__import__ = _import_wrapper

# SPDX-FileCopyrightText: 2025 David Mandelberg <david@mandelberg.org>
#
# SPDX-License-Identifier: Apache-2.0
def test_api_module(tmp_path: pathlib.Path) -> None:
    # Since sys.path and sys.modules are global state, this must use unique
    # paths and names.
    (tmp_path / "ginjarator.toml").write_text("python_paths = ['src']")
    (tmp_path / "src").mkdir()
    package = "ginjarator__python_test__test_api_module"
    package_path = tmp_path / "src" / package
    package_path.mkdir()
    (package_path / "__init__.py").write_text("")
    (package_path / "module1.py").write_text(f"from {package} import module2")
    (package_path / "module2.py").write_text("from . import module3")
    (package_path / "module3.py").write_text("import textwrap, urllib.parse")
    fs = filesystem.Filesystem(tmp_path)
    api = python.Api(fs=fs)

    module1 = api.module(f"{package}.module1")

    assert module1.module2.module3.textwrap is textwrap
    assert module1.module2.module3.urllib.parse is urllib.parse
    assert not fs.dependencies  # TODO
    assert fs.dependencies >= {
        paths.Filesystem(f"src/{package}/__init__.py"),
        paths.Filesystem(f"src/{package}/module1.py"),
        paths.Filesystem(f"src/{package}/module2.py"),
        paths.Filesystem(f"src/{package}/module3.py"),
    }

I don’t want to re-render every single template every time any module’s source changes. I only want to re-render the templates that depend on the changed module. I think for most static sites, the number of templates that need to be rendered incrementally is usually much smaller than the total number of templates, so avoiding unnecessary renders is probably more important for performance than speeding up single renders.

dseomn · June 16, 2025, 1:30am

never mind, I just handled those other imports wrong. If I change the print(name) line to:

    calling_name = globals["__spec__"].name if globals is not None else ''
    print(f"{name=}, {calling_name=}, {fromlist=}, {level=}")  # TODO

Then I see these in the output:

name='ginjarator__python_test__test_api_module', calling_name='ginjarator__python_test__test_api_module.module1', fromlist=('module2',), level=0
...
name='', calling_name='ginjarator__python_test__test_api_module.module2', fromlist=('module3',), level=1

So I could definitely implement this with __import__, but given the additional complexity of that API I’m not sure it’s worth it.

dseomn · June 16, 2025, 5:22pm

I decided to stick with the __import__ wrapper. @MegaIng thank you for suggesting that!

In case anybody finds this thread in the future while trying to solve a similar problem, here’s the working code: ginjarator/ginjarator/python.py at main · dseomn/ginjarator · GitHub