Bundling Python apps for the web & top level imports

The size of Python applications is clearly an issue for running it on a web page.

There are a number of attempts to address this, here I would like to start a discussion about the bundling of applications (i.e. taking a subset of Python files), as opposed to compressing all files or lazy loading which should probably be a separate thread.

I have started some experimental work on bundler for applications using Pyodide in GitHub - rth/pyodide-pack: Python package bundler for the web. There are likely various ways to improve this with dead code elimination / tree shaking and other things. One blocker I see however is the practice of using top level imports everywhere in Python ecosystem, particularly for binary extensions (which are quite slow to load in the browser).

There is maybe something that could be explored about rewriting code at packing time to move top level imports inside functions, but still it feels rather brittle.

So what was wondering about is whether it would make sense to have a coherent message for package maintainers about when one might want to avoid top level imports (particularly for C extensions) and instead might prefer imports inside methods/functions where these objects are actually used. I imagine there is some intersection there with optimizing import times in general (cc @mdroettboom ) as loading all those dynamic libraries even outside of the browser might be not ideal performance wise? This could apply to the stdlib as well.

To give a concrete example, numpy in its init will load most of the dynamic libraries it contains. For instance, it would load, _pocketfft_internal here which is used only one in the _raw_fft below. One could imagine moving that import directly to the function below. Most applications using numpy do not use FFT, but would still load it.

There are also downsides to moving import imports inside functions though,

  • it becomes more difficult to identify build issues. One needs to run a test suite to do so (or know which submodules to import) as opposed to just importing the top level module currently
  • there is a performance cost when calling that function in question, particularly for the first call.

I would be interested in some feedback about this issue. I imagine that’s a topic that was discussed at length before on other contexts.

There was also a somewhat related proposal Scientific Python - SPEC 1 — Lazy Loading for Submodules for scientific packages. Though I think for the WASM use case, if accepted, that might make the situation worse, as then it becomes even less predictable how many modules need to be loaded async over the network and where it happens in the code. Though maybe I’m misunderstanding how that SPEC would work.

cc @tiran @fpliger

3 Likes

There is also the very recently submitted PEP690 for lazy imports. It’s definitely more implicit than SPEC 1 – it just assumes that since lazy loading would be an optional flag (at least to start), there’s not as much need to really customize the behavior, since you can always fall back to eager loading (with a performance penalty).

I think either of these proposals will help reduce the amount of code that is loaded. But as you point out, the interesting thing in the browser/wasm environment is how more, smaller HTTP requests could have an adverse effect on things. I wonder if more “keep alive” sorts of approaches like websockets or some of the things in HTTP/3 would help.

Thanks for mentioning PEP690, quite interesting as well.

But as you point out, the interesting thing in the browser/wasm environment is how more, smaller HTTP requests could have an adverse effect on things.

Yes, there are certainly different directions to explore and partial lazy loading over the network could be one of them.

Here I’m exploring the idea that there is a bundler that looks at files used at runtime on some representative test cases and creates an archive with necessary files, with no network downloads required later. In that context, if PEP690 was implemented and enabled, on one side it would indeed solve this issue. On the other, it would require the test cases used by the bundler to have nearly 100% code coverage which is difficult to achieve. Currently, a few representative code examples are sufficient to include necessary files (because of top-level imports). With something like PEP689 even a different if branch, can now perform some attribute access requiring to access a previously not imported python module (and fail if it’s missing).

Such bundlers have been attempted in other contexts (mostly building dependency-free distributions for Python apps) and AFAIK there have always been problems that made things less than perfect – mostly modules that are loaded using some dynamic scheme that isn’t easily discovered by a generic import-follower (the stdlib has one of those: modulefinder.py).

PyOxy was just announced. I haven’t looked at it detail yet.

The problem is this then starts to run into readability and historical practices concerns. For instance, if I have to start writing import statements inside every function I write, I’m going to find that annoying both from a repetition perspective but also not having a clear idea of what the module depends on w/o grepping through the whole file.

And we aren’t even talking about data files yet.

It’s the same sort of issues JS bundlers run into by lacking complete information. The other approach I have seen is post-test analyzing what sys.modules contains.

That basically a single-file Python interpreter with a YAML API to do extra stuff at start-up. I don’t think it directly applies here, although it does use oxidized_importer Python Extension — PyOxidizer 0.24.0 documentation which gets into custom importers.

1 Like

FWIW: there’s an analogous concern here with packaging bundles apps (e.g., iOS/Android/Desktop apps). In that context, there’s a useful distinction between 2 use cases:

  1. Apps that happen to be written in Python, but don’t do anything “dynamic” at runtime.
  2. Dynamic apps that could import anything at any time.

In Case 1, the list of dependencies may not be easy to determine, but it is ultimately knowable, and can be encoded as a list of distribution requirements. Worse case, an explicit requirements list can be defined as part of the configuration. While 100% automated dependency determination would be a nice feature, even an 80% automated solution would be workable if there are manual overrides for the last 20%. The bigger problem to be solved is being able to break the standard library (and other dependencies) to the minimal set that is needed at runtime.

Case 2 is the “full Python notebook” case, or anything else where the set of runtime requirements isn’t known at the time of distribution. The problem to be solved for this case is how to ship a minimal set of requirements, but augment those requirements over time as they are required.

I’d advocate for treating the two problems as separate. It would be entirely possible to improve the situation for Case 1 that does nothing to advance Case 2 (and vice versa). The vast majority of mobile apps (and, I suspect, web apps) would benefit from a solution for case 1. A solution for case 2 would likely violate App Store distribution rules; in the web context, I imagine the analog is CORS and/or security concerns.

5 Likes

Thanks, everyone for your feedback!

Yes, this is entirely about Case 1.

The other approach I have seen is post-test analyzing what sys.modules contains.

So far I’m experimenting with post-test analysis of filesystem syscalls which is more general (also works for data files, not just .py) and relatively easy to do with Emscripten, particularly given that we fully control the environment and have a single process running, unlike with classical OSs. Conceptually it’s similar to the problem of having a bundler that only works in a given Docker image. Though the issues with dynamic behavior you mentioned are likely still a concern.

The problem is this then starts to run into readability and historical practices concerns.

OK, the answer of “let’s not touch top-level imports” and that bundlers should work with this constraint, also works for me. Thanks!

Hello! I am curious if this discussion has moved elsewhere?

Generally, I feel the need to ask the obvious question:

Why can’t I take the list of files generated by modulefinder and/or sys.modules and then walk the packaged python environment and remove all files not in those lists? Assume that the application will make all imports known at the entry point.

Cheers,
JP

1 Like

Not that I’m aware of.

That assumption doesn’t universally hold, though. People have in the past missed running some code path that causes an import to be missed, leading to a broken app.