The size of Python applications is clearly an issue for running it on a web page.
There are a number of attempts to address this, here I would like to start a discussion about the bundling of applications (i.e. taking a subset of Python files), as opposed to compressing all files or lazy loading which should probably be a separate thread.
I have started some experimental work on bundler for applications using Pyodide in GitHub - rth/pyodide-pack: Python package bundler for the web. There are likely various ways to improve this with dead code elimination / tree shaking and other things. One blocker I see however is the practice of using top level imports everywhere in Python ecosystem, particularly for binary extensions (which are quite slow to load in the browser).
There is maybe something that could be explored about rewriting code at packing time to move top level imports inside functions, but still it feels rather brittle.
So what was wondering about is whether it would make sense to have a coherent message for package maintainers about when one might want to avoid top level imports (particularly for C extensions) and instead might prefer imports inside methods/functions where these objects are actually used. I imagine there is some intersection there with optimizing import times in general (cc @mdroettboom ) as loading all those dynamic libraries even outside of the browser might be not ideal performance wise? This could apply to the stdlib as well.
To give a concrete example, numpy in its init will load most of the dynamic libraries it contains. For instance, it would load, _pocketfft_internal
here which is used only one in the _raw_fft
below. One could imagine moving that import directly to the function below. Most applications using numpy do not use FFT, but would still load it.
There are also downsides to moving import imports inside functions though,
- it becomes more difficult to identify build issues. One needs to run a test suite to do so (or know which submodules to import) as opposed to just importing the top level module currently
- there is a performance cost when calling that function in question, particularly for the first call.
I would be interested in some feedback about this issue. I imagine that’s a topic that was discussed at length before on other contexts.
There was also a somewhat related proposal Scientific Python - SPEC 1 — Lazy Loading for Submodules for scientific packages. Though I think for the WASM use case, if accepted, that might make the situation worse, as then it becomes even less predictable how many modules need to be loaded async over the network and where it happens in the code. Though maybe I’m misunderstanding how that SPEC would work.