Async imports to reduce startup times

mauve · November 2, 2024, 11:04am

Yes, this would be for warming caches inside import hooks, not actually executing anything.

ncoghlan · November 2, 2024, 11:25am

For big modules like torch, it really is loading the module itself (and its dependencies) that takes the time, not finding them on disk (we see this in the stark -X importtime differences between cold and warm operating system disk caches). importlib already caches things pretty heavily to reduce system calls (hence importlib.invalidate_caches() existing), and while there may be opportunities for further improvement there (see @cmaloney’s comments earlier in the thread), those should mainly just be import system performance improvements that are transparent to end users, rather than being something that changes the way code is written.

By contrast, for CLI utilities with multiple subcommands, the pay-off they get from lazy imports is that they then only have to pay for what they use, which means information query commands like cli-app --version don’t pay anything except the cost of importing the module that contains the runtime __version__ attribute.

The downside of current lazy import techniques is that any attribute access on a module may trigger that initial runtime latency hit of resolving the lazy import (and may throw arbitrary exceptions resulting from the module execution).

The question that prompted this particular thread is whether there might be value in a middle ground that lets modules declare at import time “Hey, I’m probably going to need this dependency once anyone actually starts using the functions and classes in this module”, and then have a way to let applications decide when to actually run all those deferred imports (including the default behaviour of simply letting them run when the function level code first executes import mod_name or from mod_name import ...). (I’ll edit the initial post to explain that nobody liked the idea of implicitly launching the imports in the background)

DavidCEllis · November 2, 2024, 11:47am

Sorry if this is getting a little sidetracked but this is actually one of the specific things I did want to change with my lazy importer, by coupling the potential import with use of the object/module being imported. The import will just happen the first time the attribute is accessed.

The idea is that if you have a function where some - but not all - branches may end up using a module with a large import time, the import only occurs if those specific branches are hit, without needing to put the import statement inside each branch in order to do so.

Say you have a function that’s something like this:

import big_module

def function_that_might_do_expensive_stuff():
    if condition:
        return "didn't use big_module"
    elif other_condition:
        return big_module.expensive_function(...)
    elif yet_other_condition:
        if sub_condition:
            return big_module.other_expensive_function(...)
        else:
            return "didn't use big_module"
    else:
        return big_module.expensive_function(...)

With the way things are your ‘best’ case would need to look like this, only doing the work of importing the large module when it’s actually needed:

def function_that_might_do_expensive_stuff():
    if condition:
        return "didn't use big_module"
    elif other_condition:
        import big_module
        return big_module.expensive_function(...)
    elif yet_other_condition:
        if sub_condition:
            import big_module
            return big_module.other_expensive_function(...)
        else:
            return "didn't use big_module"
    else:
        import big_module
        return big_module.expensive_function(...)

But I think to avoid the repetitive nature of this people are likely to at best put the import at the top of the function or even the module, and so end up doing the unnecessary import work even when not using the module.

Yes this can be quite noticeable:

Benchmark 1: DUCKTOOLS_EAGER_IMPORT=False ducktools-env --version
  Time (mean ± σ):      33.6 ms ±   2.5 ms    [User: 27.9 ms, System: 5.8 ms]
  Range (min … max):    29.9 ms …  38.7 ms    20 runs
 
Benchmark 2: DUCKTOOLS_EAGER_IMPORT=True ducktools-env --version
  Time (mean ± σ):     135.2 ms ±   3.1 ms    [User: 120.0 ms, System: 15.0 ms]
  Range (min … max):   131.2 ms … 142.3 ms    20 runs
 
Summary
  'DUCKTOOLS_EAGER_IMPORT=False ducktools-env --version' ran
    4.02 ± 0.31 times faster than 'DUCKTOOLS_EAGER_IMPORT=True ducktools-env --version'

ncoghlan · November 2, 2024, 1:32pm

I don’t think that’s a sidetrack, I think it’s at the heart of why there hasn’t been a push for language level changes since PEP 690 was rejected: to get the maximum benefit from lazy imports, you really do need to delay the imports until those modules are actually accessed. Anything less (including the ideas in this thread) means that there will be cases where you end up paying for imports that you never actually use.

It also occurred to me that there’s a straightforward way for folks that want to avoid ad hoc import latencies at arbitrary points during runtime execution (such as @barry-scott above) to force resolution of all lazy imports:

    for module in sys.modules.values():
        module.__name__ # Ensure all modules are actually loaded

Edit: To be truly thorough, I think that actually needs to be:

def ensure_all_modules_loaded():
    mod_cache = sys.modules
    modules_to_load = set(sys.modules)
    while modules_to_load:
        for mod_name in modules_to_load:
            mod_cache[name].__name__
        modules_to_load = set(sys.modules) - modules_to_load

Since each new module loaded will presumably register more modules that need loading.

ringohoffman · November 2, 2024, 7:35pm

transformers has their own _LazyModule implementation that they use to reduce import times until attributes are accessed. It involves building a dictionary describing the import structure and then dispatching imports on attribute accesses to the _LazyModule class, which is manually inserted into sys.modules.

See:

They have been using this implementation for about 4 years. When they first began using this solution, it took importing transformers from 2.3s to 239ms. As the package has grown and continued to add more and more model implementations, I am sure the relative efficiency has only increased.

They say themselves that their implementation is inspired by Optuna’s _IntegrationModule. As long as something like this is being considered, I’m sure it will be instructive to survey existing lazy loading solutions to help motivate a standard library design.

cmaloney · November 2, 2024, 11:07pm

This sounds a lot like in this case it would be really nice to stash the whole interpreter state and just load that up rather than re-execute the bytecode? Would likely need rules around what can do in globals (but at least some past experience I have a lot of modules do a lot of work to keep from doing per-run/machine computation in globals).

Basically how do you say “here is the whole interpreter state that would be built up when you get to if __name__ == "__main__"”, and getting there without directly loading and executing potentially large amount of python source code or byte code (ex. just mmap()'ed state (one relatively optimized operation) that will be only loaded if accessed fine) or byte code.

NeilGirdhar · November 3, 2024, 3:42am

That’s funny, I did exactly this for a proprietary embedded JVM at a company I worked for in 1999. It worked amazing especially because the platform was extremely slow at executing code.

I guess if you can hash everything that’s going to be executed and cache the interpreter state and guarantee that there are no side effects, it could work here.

DavidCEllis · November 3, 2024, 12:25pm

So far I think we’ve mostly discussed the side of CLI tool or application authors wanting to delay importing expensive modules for their own use.

This appears to be for the other side of the problem. For library authors who want to allow people to do from module import ExpensiveClass where ExpensiveClass is defined in a submodule that is expensive to import, without importing this submodule when import module runs.

This is actually one of the use cases given when module level __getattr__ was added. However as SPEC 1 indicates and as the transformers example shows this involves essentially duplicating all of your imports if you want tooling to recognise it (or doing imports in a separate stub file and using attach_stub).

I think the current solution for this case (without helper modules) would now be something like:

module.py

def __getattr__(attr):
    if attr == "ExpensiveClass":
        from .submodule import ExpensiveClass
        globals()[attr] = ExpensiveClass
        return ExpensiveClass
    else:
        raise AttributeError(...)

With the scientific python lazy_loader

module.pyi

from .submodule import ExpensiveClass as ExpensiveClass

module.py

import lazy_loader as lazy

__getattr__, __dir__, __all__ = lazy.attach_stub(__name__, __file__)

With transformers LazyModule.
module.py

import sys
from .utils import _LazyModule

_import_structure = {
    "submodule": ["ExpensiveClass"],
}
sys.modules[__name__] = _LazyModule(
    __name__, 
    globals()["__file__"], 
    _import_structure, 
    module_spec=__spec__
)

And with my own tool just because I was trying to handle both this use case and the one for inline imports:

from ducktools.lazyimporter import LazyImporter, FromImport, get_module_funcs

_laz = LazyImporter(
    [FromImport(".submodule", "ExpensiveClass")],
    globs=globals(),
)

__getattr__, __dir__ = get_module_funcs(_laz, module_name=__name__)

In each case (other than the scientific python one with the stub file) these would need a type checking block or equivalent for tooling to recognise the imports requiring duplication of the work and keeping things in sync. It would be nice to be able to improve this and remove the need for the duplication too.

sirosen · November 3, 2024, 1:17pm

I also would like to see some kind of simplification for lazy module authorship.

In my case, I resorted to code gen for a large package, which produces a very long __init__.py with a type checking branch and all of the logic for __getattr__ and __dir__.

I think library authors are willing to make small adjustments for lazier import semantics. But right now (1) there’s no canonically correct solution and (2) several of the solutions which do exist are difficult to apply.
It’s not even necessarily desirable that there be one good way to do it, but I think it should be more obvious how to get started.

I’m not sure how to tackle this issue from the perspective of “make it easier for library authors”. Are any of the solutions on pypi sufficiently popular and well maintained to be linked from the importlib docs? Or is there a suite of tools which could make this easier if added to importlib? Should there be a guide doc on how to implement lazy imports using getattr?

vnmabus · November 5, 2024, 4:18pm

I sent this PR to the lazy_loader repo, that has the potential to greatly simplify the syntax, but it could benefit from some feedback before it is put into the package: Add context manager functionality using hooks by vnmabus · Pull Request #121 · scientific-python/lazy-loader · GitHub. I see here many people interested in this topic, so it would be good if you could add your feedback to the PR, to detect potential problems early on.

mikeshardmind · November 5, 2024, 8:00pm

One of the issues with this approach (I’ll add some specific notes about ways to minimize the negative potential on the PR itself in a moment) is that it can’t realistically ever fully cooperate with other meta finders, at least not without all of them knowing about each other or a central registry. (this has come up with other libraries that use the context manager approach already)

While I’m mostly happy using these solutions, I think if they become more common, it may need to be adopted into the standard library in some form to prevent the conflicting behavior problems that may arise.

I think my ideal here would be reviving pep 690, but with a way to mark a module as safe to lazily import (ie. doesn’t rely on side effects of executing the module happening in a specific order) and then the interpreter is free to make decisions about how to import those modules and when to do so so long as it is “before use” (loosely termed, because declarative things like an annotation might not be use until a scope at which access happens)

The opt-in nature should prevent this from being a disruptive change in terms of import behavior breaking anyone.

By the opt-in coming from the author of a module, rather than those importing the module, and specifically leaving the import behavior in that case up to the interpreter, it also leaves the decisions of “is it safe?” and “how should that be accomplished?” to those who theoretically have the information to know that.

DavidCEllis · November 11, 2024, 10:45pm

I think I would prefer that lazy imports, instead of being a kind of ‘opt-in’ flag, would just work by both being used and being shallow, essentially the opposite of the importlib.eager_imports context manager proposed by the PEP.

I’ve ended up experimenting with a context manager based lazy importer that temporarily replaces __import__ in order to capture the calls and prevent the import mechanisms from triggering at all and instead prepares a LazyImporter with the attributes.

The documentation actually seems to imply that if all you want to do is change the meaning of import statements that this would be the way. It also implies that this could be done to only affect the current module but doesn’t seem to explain how^[1]?

From the docs (emphasis mine):

If it is acceptable to only alter the behaviour of import statements without affecting other APIs that access the import system, then replacing the builtin __import__() function may be sufficient. This technique may also be employed at the module level to only alter the behaviour of import statements within that module.

I’ve been replacing builtins.__import__ with a function that will passthrough imports if globals is not the current module namespace, but that’s fragile if something else replaces builtins.__import__ while within the block. ↩︎

markshannon · November 20, 2024, 11:00am

It make senses, when looking for ways to speed things up, to break down the time spent importing a module into 3 steps:

Loading from disk
Unmarshalling: turning the on-disk format into an executable code object
Executing the code object to create a module object.

Dan suggested an approach to hide 1, which wouldn’t be too hard to implement
We can reduce 2 considerably, but we haven’t yet because it doesn’t seem like step 2 is that important.
Step 3 is where I suspect most of the time is spent.

Do we have any numbers on the relative time spent in the three parts?
There is no point in putting a lot of effort into something that doesn’t matter.

Another question: Where is the time spent in step 3? How much is in class and function creation?
Class creation is quite slow, and could be improved.

zhangyx · November 20, 2024, 6:56pm

Just being curious: as of the current version of python, are these two steps done in a separate thread (parallel) or are they done in the main thread (blocking)?

barry-scott · November 20, 2024, 8:59pm

Since the main thread must wait for the import to complete I do not see how an import thread will help.

zhangyx · November 20, 2024, 10:44pm

No it won’t help now. But it might help once async import is supported.

zhangyx · November 20, 2024, 10:55pm

Upon second thought, this might even help improving performance of synchronous imports. We can predictively fetch and parse code objects based on “future” statements.

Suppose:

import numpy, torch, cv2

All three modules can be prepared in parallel but executed sequentially (to preserve order of side effects).

P.S. All these are based on the assumption that no similar optimization has been implemented already. Perhaps a better solution is already integrated. I know very little about this.

Edit: and there is no evidence of how much improvement this could provide. I strongly agree that optimization should be based on actual instrumentation evidence.

ncoghlan · November 20, 2024, 11:00pm

Everything is currently synchronous (modulo actual threaded imports)

Rosuav · November 20, 2024, 11:02pm

Not sure how this is any more parallelizable than having three statements one after another. Either way, the first import in the list could do anything, including changing sys.path. Any optimizations have to be entirely transparent (for example, having some way to recognize that affects imports, and flushing the cache), and with that consideration, ALL top-level imports could be optimized the same way.

zhangyx · November 20, 2024, 11:15pm

They should work the same. I typed this way because I was replying on my phone and it saves me some effort.

Like predictive execution on a modern CPU, a predictive fetch can be invalidated in the worst case. No side effect will be introduced before actual execution.