Async imports to reduce startup times

Continuing the discussion from Make unicodedata.normalize a str method:

We’ve had various ideas for implicitly and explicitly lazy imports, but this thread inspired one I hadn’t seen considered before:

async import unicodedata

This would put a future object in sys.modules and start a background thread to do the actual import (the actual API would provide blocking sync methods in addition to the async methods). It would not bind any names in the current scope, it would just start the import running.

(Edit: nobody liked the version of the idea that involved implicitly launching background imports, so the idea later changed to instead just registering these deferred import requests with the import system, and leaving it up to application code to decide exactly when and how the full imports should be executed)

Waiting for the import later would then be a matter of doing either:

import unicodedata

Or

await import unicodedata

Either spelling would wait for the import to finish, either synchronously or asynchronously.

3 Likes

Interesting. I’d be curious to know whether the cost of importing a module like this is primarily:

  1. Searching sys.path etc to find the module?
  2. Reading the file from disk (or other source)
  3. Executing the module, if no .pyc, or unmarshalling the pyc?

If the main cost is in 1 or 2, then spinning off a thread would make good sense (in theory, async I/O would be even better, but that would make a LOT of assumptions about the nature of the app); but if the cost is in #3, this might not actually save much - unless this is depending on free threading, in which case that’d be important to mention.

What would it be like if you do this, exactly with current semantics?

def fetch():
    import unicodedata
threading.Thread(target=fetch).start()

I’m not sure how to lag that out horrifically to try to make it slow enough to poke around with, but I’m curious what the main thread sees in sys.modules["unicodedata"] while another thread is doing the fetch.

8 Likes

Do we really want more implicit thread use and thread joins happening somewhere internally? I presume that the moment someone actually uses the import would block on completing the background work, so this seems like a worse version of generalized lazy imports, and there are existing solutions I’ve seen to make imports lazy/lazier already, but the approach to bring it into the standard library (pep 690) was rejected already.

6 Likes

(Note: as far as stdlib examples go, re is a better one than unicodedata - it routinely weighs in at 3.5+ ms import time even with a warm disk cache, vs the < 0.2 ms of unicodedata)

The thing I like about this idea is that it avoids some of the major reasons that PEP 690 was rejected:

But a problem we deem significant when adding lazy imports as a language feature is that it becomes a split in the community over how imports work. A need to test code both ways in both traditional and lazy import setups arises. It creates a divergence between projects who expect and rely upon import time code execution and those who forbid it. It also introduces the possibility of unexpected import related exceptions occurring in code at the time of first use virtually anywhere. Such exceptions could bubble up from transitive dependency first use in unanticipated places.

With explicit syntax (async import ...) to denote “start importing this in the background, but don’t bind it locally yet”, and “ensure this module is fully imported” (allowing both synchronous waits with import ... and asynchronous waits with await import ...), and regular import statements still meaning “don’t continue processing in this thread until this module has been imported”, there’s no fundamental change to the way imports work based on a hidden global mode switch.

Instead, there’s just some more convenient syntax for spinning heavy imports out to a thread pool to finish loading them before the main thread needs them (while letting the main thread continue on with other work in the meantime).

It’s certainly an idea with genuine flaws (in particular, anything you do import in the main thread may still end up blocking waiting for something that a background thread is importing, and you’re still exposed to all the deadlock issues that affect threaded imports), it just doesn’t have the same flaws as previous proposals (and the flaws the idea has are intrinsic in allowing imports from subthreads in the first place).

While you can theoretically set up that experiment via a test module that waits for a locking event to be set, it’s easier to just look at what importlib._bootstrap actually does to handle threaded imports.

  1. It has a _module_locks weak-value dictionary to ensure only one thread is trying to import a given module at a time. Any other threads that try to do the same import will block on the module lock waiting for the importing thread to finish (that is, import ... already implements the blocking semantics that this idea would need as part of supporting threaded imports at all)
  2. It has a _blocking_on weak-value dictionary to keep track of which module locks a given thread ID is waiting on
  3. It has assorted machinery to help manage deadlock detection when different threads import the same modules in different orders, as well as to avoid deadlocks when the same thread tries to reimport modules that it is already importing
  4. The module itself is stored in sys.modules while it is being loaded (to give circular imports their best chance of working). This “initialization in progress” state can be detected by way of module.__spec__._initializing existing, and being true (although that’s technically just an implementation detail of the current algorithm - those specs could just as easily be kept in yet another weak-value dictionary to indicate initialization was in progress).

So for this idea, none of that would change (despite what I wrote earlier about storing something new in sys.modules - I had forgotten about this aspect of threaded imports when I wrote that).

The only new bits would be:

  • async import ... syntactic sugar: if there is an async event loop active in the current thread, runs the import in the background with asyncio.to_thread(threaded_import). If there is no event loop active, spawns a new thread directly (as shown in @Rosuav’s post above).
  • await import ... syntactic sugar: this would just be a thin wrapper around asyncio.to_thread(threaded_import), letting the synchronous module locks take care of the “wait for the threaded import to finish” bit

For asynchronous from imports, there could also be a from await ... import ... form.
await import ... with multiple names would be equivalent to an asyncio.gather() call on the individual modules.

Edit: there may be one global mode switch worth considering: an importlib.defer_async_imports() and importlib.start_async_imports() pair. The use case would be things like --help and --version commands in CLI applications, where there’s no point in loading all the heavy implementation machinery until after you know you’re actually going to be doing some real work on that run.

1 Like

I think there’s quite a bit of time I think that can be saved in imports via engineering rather than implicit asynchronicity here. I know reading individual .pyc files can be sped up ~15% (see 1 below). I suspect there’s also quite a bit of savings in doing less round trips during import to the OS (see 2 below). From my perspective a lot better to make the common case faster/simpler than to add more ways of doing.

Packaged with lots of small files on disk independently read → lots of system calls → lots of round trips which take time (the interpreter is faster than all the round trips in the cases I’ve investigated). I think it’s a big part of the win the frozen .so modules get.

With async import I think you’d have a lot of hard threading edge cases where async import A followed by import B would load some common module C which would cause contention potentially.

  1. Reading text / byes passes through BufferedIO currently which has some time that can be saved. the OpenCode hook (File Objects — Python 3.13.0 documentation) unfortunately means can’t change from open("rb") to open("rb", buffering=0) which has a ~15% performance improvement in reading bytes (.pyc files), measured by changing pathlib.Path.read_bytes() (GH-120754: Disable buffering in Path.read_bytes by cmaloney · Pull Request #122111 · python/cpython · GitHub). This is the source of the lseek() system calls when reading .pyc files currently. This is really intricate to change, but also would make a lot of things just a little bit faster.
  2. Investigate different cache strategies that eliminate the need to repeatedly scan a directory + stat multiple files + open + read. Just caching stat calls didn’t make a lot of performance change for me in experimenting (importlib nicely has a single function designed for implementing just that) but if on each import (and the recursive imports it loads) emitted a “cache index” file so with a single file stat + read can reliably go from source file → this/these are the up to date .pyc files to read should save a lot of work. From my measuring here what is adding a lot of time is the round trips (python → OS → python) not the actual I/O or computation. This was investigated in PEP 3147 that added __pycache__ but deferred then. Now that disks / OSes can move the bytes a lot faster, I think it’s likely to be more of a win for the complexity.
  3. Being able to open+read multiple .py/.pyc or similar in parallel (possibly optimistically based on past runs?). Getting to this may require adding to or modifying the OpenCode hook which has been stable a long time, so not a path I’ve looked at as much. Definitely loading a lot of small text files in a directory (ex. yaml/json) I’ve seen some very good speedups. I’m hoping to make the io module able to support this for files generally, but am not there quite yet (another likely prerequisite).
2. Underlying system calls on Linux for just the `re` module loading
newfstatat(AT_FDCWD, "/home/user/projects/python/cpython/Lib/re", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
newfstatat(AT_FDCWD, "/home/user/projects/python/cpython/Lib/re", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
newfstatat(AT_FDCWD, "/home/user/projects/python/cpython/Lib/re", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
openat(AT_FDCWD, "/home/user/projects/python/cpython/Lib/re", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
getdents64(3, 0x623f2e688da0 /* 8 entries */, 32768) = 248
getdents64(3, 0x623f2e688da0 /* 0 entries */, 32768) = 0
close(3)                                = 0
newfstatat(AT_FDCWD, "/home/user/projects/python/cpython/Lib/re/_compiler.py", {st_mode=S_IFREG|0644, st_size=26545, ...}, 0) = 0
newfstatat(AT_FDCWD, "/home/user/projects/python/cpython/Lib/re/_compiler.py", {st_mode=S_IFREG|0644, st_size=26545, ...}, 0) = 0
openat(AT_FDCWD, "/home/user/projects/python/cpython/Lib/re/__pycache__/_compiler.cpython-314.pyc", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=27140, ...}) = 0
lseek(3, 0, SEEK_CUR)                   = 0
read(3, "\31\16\r\n\0\0\0\0^,\207f\261g\0\0\343\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0"..., 27141) = 27140
read(3, "", 1)                          = 0
close(3)                                = 0
newfstatat(AT_FDCWD, "/home/user/projects/python/cpython/Lib/re", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
newfstatat(AT_FDCWD, "/home/user/projects/python/cpython/Lib/re/_parser.py", {st_mode=S_IFREG|0644, st_size=40294, ...}, 0) = 0
newfstatat(AT_FDCWD, "/home/user/projects/python/cpython/Lib/re/_parser.py", {st_mode=S_IFREG|0644, st_size=40294, ...}, 0) = 0
openat(AT_FDCWD, "/home/user/projects/python/cpython/Lib/re/__pycache__/_parser.cpython-314.pyc", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=42624, ...}) = 0
lseek(3, 0, SEEK_CUR)                   = 0
brk(0x623f2e6ca000)                     = 0x623f2e6ca000
read(3, "\31\16\r\n\0\0\0\0\1u\301ff\235\0\0\343\0\0\0\0\0\0\0\0\0\0\0\0\25\0\0"..., 42625) = 42624
read(3, "", 1)                          = 0
close(3)                                = 0

Feels like it should be able to get down to:

  1. Read in stdlib cachefile (created at build/distribution time by compileall or on first run of the implicit imports)
  2. Lookup re module inside of it
  3. Open and read that specific set of .pyc files if not already in module cache.
6 Likes

My gut feeling is that, in these situations, you would do best to import B followed by async import A - that is, all your async imports should come AFTER your synchronous ones. If you’re going to wait for B anyway, you may as well wait for it first, and only then fire off the import of A.

1 Like

As for the re module, the main contributors are enum, functools and collections (indirectly via functools)

import time:       450 |        450 |     types
import time:      1961 |       2411 |   enum
import time:        87 |         87 |     _sre
import time:       352 |        352 |       re._constants
import time:       529 |        881 |     re._parser
import time:       197 |        197 |     re._casefix
import time:       438 |       1600 |   re._compiler
import time:      1219 |       1219 |       _collections_abc
import time:       152 |        152 |       itertools
import time:       215 |        215 |       keyword
import time:        94 |         94 |         _operator
import time:       434 |        528 |       operator
import time:       245 |        245 |       reprlib
import time:       573 |        573 |       _collections
import time:      1277 |       4204 |     collections
import time:        74 |         74 |     _functools
import time:      1406 |       5683 |   functools
import time:       229 |        229 |   copyreg
import time:       731 |      10652 | re

enum is needed for RegexFlag. Creating an enum class is costly too, it involves a lot of introspection. Initially they were implemented without the enum module. Do we need a limited faster version of enum?

functools is needed for lru_cache() which is used to cache the compiled templates in re.sub(). collections is imported in functools only for namedtuple which is used for the result of lru_cache().cache_info(). It is rarely needed, but I do not see how to make this lazy.

It seems than in all cases most times is spent on executing Python code, in particularly creating enum and namedtuple classes. A thread will not help here. I’m also against creating implicit threads in the stdlib.

10 Likes

I can see a need to be able to force the async imports to be done sync.

The use case I’m thinking of is where some library writers decide that asymc import is cool.

But for my app I do not care about start up time, what I care about is runtime latencies and having async import firing at uncontrolled points in time could be an issue.

1 Like

What ever happened to the lazy import discussions. It seemed like something that everyone wants.

Non-stdlib examples provide better background for the potential benefit of deferring import work.

PyTorch with a cold disk cache (on an NVMe SSD) can nearly reach the 2 second mark:

import time: self [us] | cumulative | imported package
...
import time:     33365 |    1962108 | torch

With a warm disk cache (same SSD), the load time drops back to reliably being under a second:

import time: self [us] | cumulative | imported package
...
import time:     11415 |     686148 | torch

What I’m starting to wonder is if rather than starting a background import implicitly, async import ... could instead be a shorthand for a new importlib.machinery.request_background_import operation that registered the background import as being requested, and performed the importlib.util.find_spec step, but didn’t actually run the import itself.

Essentially the module will be telling the interpreter “I will need this at runtime, but I don’t need it at import time”.

It would then be up to the application to decide “OK, all the background imports should actually happen now”, and deal with any import errors that come up. Whether those background imports happen synchronously in the main thread, or get dispatched to a thread pool with concurrent.futures.ThreadPoolExecutor, would be up to the main application.

As the simplest possible thread-safe dispatch-friendly API, this might look like:

for mod_name in importlib.util.get_requested_background_imports():
    importlib.import_module(mod_name)

That could be wrapped up in a importlib.util.run_background_imports() convenience function for applications that just wanted to delay the async imports, without attempt to parallelize them (parallelization gets potentially interesting in free-threaded Python builds, but for use cases like cli-app --version, delaying the imports is the important bit).

If nothing actually executes the requested background imports, they’ll happen inline wherever function level import ... statements appear (no need for any new await import ... syntax, since the name binding imports would still always be synchronous in this approach). Since running the import again would just retrieve the module from sys.modules, these wouldn’t even need to pop the module name from the set of background imports (although they could).

PEP 690 getting rejected killed the idea of implicitly lazy imports, where a global flag made all imports lazy, and you didn’t get import errors until the first module attribute access attempt.

So folks that really need lazy imports continue to use the existing solutions for setting them up via sys.meta_path manipulation, and we haven’t come across any compelling ideas for improved language level support for them.

I think a lot of the enthusiasm in getting anything accepted into core python or the stdlib was lost with the PEP 690 rejection. There were definitely issues with the PEP (I really didn’t like the required -L argument, as it would have effectively made them unusable for CLI tools where you can see significant performance improvement from lazy imports[1]) but it’s a shame that the result so far is that we have nothing.


I ended up making a small lazyimporter library for my own use, initially to avoid some of the larger imports[2] that other lazy import modules eagerly performed. It roughly involves registering importers on a LazyImporter object that are then imported and cached when their asname is accessed via __getattr__.

I’m not sure it’s the best design, which is one reason I haven’t really told anyone about it before. You unfortunately lose autocompletion and the ability to use these imports for type hints due to the dynamic nature. However, it does make it possible to do try/except and conditional lazy imports[3] and it’s obvious when you’re doing something which might trigger an import as you’re accessing it from the LazyImporter object you have constructed[4].


Which does lead me to raise the question of if/how something like this async importer could handle a conditional import or a try/except?


  1. Look at how many imports pip does with python -Ximporttime -m pip --version, pip is not unique in this. ↩︎

  2. Actually to avoid almost any imports. ↩︎

  3. I use this for lazily importing tomllib or tomli if it is unavailable, for instance. If there’s no toml file to parse, I don’t need the overhead of loading the toml parser in the first place. ↩︎

  4. There is both a function and an environment variable that can be used to force lazy importers to perform their imports eagerly. ↩︎

1 Like

I think the implicit thread creation idea that nobody liked would have struggled with try/except, but conditional imports would still have been OK.

In the variant where async import name is just syntactic sugar for a new registration function that runs find_spec immediately, but puts off actually doing the module load until it is told to run them all, both would be fine.

Same here, and for an app where startup time does matter (such as a frequently-run large CLI tool) I would make the main program a persistent background process and make the frontend CLI script a thin client that simply passes sys.argv to the background process via an IPC channel and dumps its response.

1 Like

I think an improvement in deferred imports of some kind would be better both for CLI tools where startup time matters and applications where runtime latencies are more important as long as at an application level you could force the imports to happen immediately.

Currently, one of the most common methods of deferring imports at the moment is to hide the import away inside the function or branch that is going to use it with no way to force the import other than knowing about it and pre-emptively importing it yourself. Having a standard way to force all such imports to happen eagerly would surely be better for the case where latencies are more important if it could replace this idiom.

5 Likes

It’s possible to use a contextmanager that manipulates sys.meta_path to get explicit lazy imports that work well with IDEs / static type checkers. One thing you need to be careful about when implementing is that find_spec isn’t truly lazy, it will import parent modules (you can reimplement some of find_spec in user code to workaround this)

There’s sort of 2 elements to this.

One is that I like the lazy import being visible both when it’s setup and when it may be triggered (as any attribute access on a LazyImporter object may trigger an import).

The other is that by using a delayed call to __import__ instead of working with the import machinery directly if I go away and work on unrelated projects for months and then return to fix an issue or add some feature I need I’m only working with general Python magic methods and don’t have to remember/relearn any of the inner details of what happens when you call import.

Having IDE completion and type checking is more of a nice-to-have for me over keeping the implementation simple and as far away from importlib internals as possible[1]. (I’m also not sure how try/except would work in that context.)

Edit: Added emphasis to ‘for me’, I recognise that these may be more important to others.


  1. I’ve worked with import hooks before for a dataclasses-like implementation in the AST before compilation. It ended up being a bit of a headache and had its own issues, but did at least surface an importlib bug in over how a function call in one method could possibly work (the call wouldn’t have worked, but the branch was also unreachable). ↩︎

1 Like

I wonder if the interpreter could notice a grouping of imports on multiple lines and internally async import them, then join before moving to the next lines of code.

Maybe a context manager that changes import behavior to ^ for a given block could be helpful.

All that said I honestly think just about every idea like this (including this and most others I’ve read) to speed up imports has been too complicated to take off.

Complexity (especially when debugging failed imports) was certainly a key theme in the PEP 690 rejection.

One of the things I’m aiming for in this thread is to avoid changing the consumption side complexity: deferred imports are resolved the same way they are now, which is via function level import statements. Even the initial version of the await import ... idea just combined a regular import with asyncio.to_thread.

I still think there’s a potentially useful meaning that can be given to async import ..., which would be to allow the interpreter to compile an import time map of runtime import dependencies, without having to fully resolve those dependencies at import time. There wouldn’t be any effect on either sys.modules or the importing namespace at the time the deferred import notification is submitted.

This is different from the way existing meta path based lazy importers work, as those need to come up with something to stick in sys.modules and the importing namespace, which is where all the lazy attribute resolution magic that got PEP 690 rejected comes into play.

The potential pieces for a useful enhancement that I’m currently seeing would be to add the machinery described below, so an application could resolve deferred imports at the time of its choosing by calling importlib.util.resolve_deferred_imports(). Alternatively, it could resolve deferred imports for a subset of eagerly imported modules by giving the name of the starting point, or do its own thing (such as resolving deferred imports in a thread pool) by iterating over importlib.util.iter_deferred_imports() directly.

I’m still not sure this would be worth it, but it does offer a potential approach that avoids the problems that got PEP 690 rejected, while still allowing modules to declare the module dependencies that they don’t need at import time, but will need at runtime.

importlib.util.import_module_from_spec(spec)

This would be similar to importlib.import_module, but accept a full ModuleSpec instance instead of just the module name. Both checks and updates sys.modules (unlike importlib.util.module_from_spec), while fully respecting the module import locking thread synchronisation machinery (unlike the example code in the importlib docs). Raises a new ModuleSpecConflictError subclass of ImportError if the module is already present in sys.modules with conflicting __spec__ details.

importlib.util.defer_runtime_import(name, package=None, importing_name=None)
# Mapping from deferred runtime imports to their specs
_deferred_imports: dict[str, ModuleSpec] = {}
# Mapping from deferred runtime imports to the modules requesting them
_deferred_by_target: dict[str, set[str]] = {}
# Mapping from module names to their deferred runtime imports
_deferred_by_importer: dict[str, set[str]] = {}

def defer_runtime_import(name, package=None, importing_name=None):
    """Register an expected future runtime import"""
    module = sys.modules.get(name)
    if module is not None:
      # Already imported (or is being imported)
      return module.__spec__
    mod_spec = _deferred_imports.get(name)
    if mod_spec is None:
      mod_spec = importlib.util.find_spec(name, package)
      if mod_spec is None:
        # Eagerly report missing dependencies
        raise ImportError(...)
      _deferred_imports[name] = mod_spec
   if importing_name is not None:
       _deferred_by_target.setdefault(name, set()).add(importing_name)
       _deferred_by_importer.setdefault(importing_name, set()).add(name)
  return mod_spec
importlib.util.iter_deferred_imports(importing_name=None)
def iter_deferred_imports(importing_name=None):
  """Iterate over the details of registered deferred runtime imports"""
  if importing_name is None:
    mod_names = tuple(_deferred_imports)
  else:
    mod_names = tuple(_deferred_by_importer.get(importing_name, ())
  for name in mod_names:
    mod_spec = _deferred_imports[mod_name]
    importing_names = _deferred_by_target.get(mod_name, ())
    yield mod_name, mod_spec, tuple(importing_names)
importlib.util.resolve_deferred_import(name)
def resolve_deferred_import(name):
    """Resolve a registered deferred import"""
    mod_spec = _deferred_imports.get(name, None)
    if mod_spec is None:
        # Let the caller decide if this is an error or not
        return None
    module = import_module_from_spec(mod_spec)
    del _deferred_imports[name]
    importing_names = _deferred_by_target.pop(name, ())
    for importing_name in importing_names:
        targets = _deferred_by_importer.get(importing_name, None)
        if targets is None:
            continue
        targets.remove(name)
    return module
importlib.util.resolve_deferred_imports(importing_name=None)
def resolve_deferred_imports(importing_name=None):
    deferred_imports = iter_deferred_imports(importing_name)
    for mod_name, __, __ in deferred_imports:
        resolved_deferred_import(mod_name)

What if, instead of threads and awkward syntax, we add an affordance for the import mechanisms to perform I/O concurrently but retain serial evaluation of the modules and population of sys.modules etc.

The compiler could identify blocks of consecutive import statements, which are common, and hand all the module names to the import system in bulk.

So

import ham.spam
from .eggs import spam as es
es.stuff()
import lobster_thermidor

would evaluate like

__builtin__.__preimport__(__package__, ('ham.spam', '.eggs'))
import ham.spam
from .eggs import spam as es
es.stuff()
__builtin__.__preimport__(__package__, ('lobster_thermidor',))
import lobster_thermidor

Then we could add APIs in the import system to perform I/O concurrently, maybe with io_uring on Linux, or a thread pool. But it’s only I/O to traverse sys.path and read bytes into caches, no evaluation, so it’s more widely applicable and there’s much less scope for it to break everything.

Less, but not none. If you import something that changes sys.path, this won’t work. I’m not sure if there’s a way around that (maybe start preemptively scanning for the next module(s) needed, but being prepared to discard that if something changes), but it is an issue nonetheless.