Minifying the stdlib (in Pyodide)

rth · April 26, 2021, 9:54pm

If one wanted to reduce the size on disk of stdlib what do you think would be the best approach?

The use case is Pyodide where a REPL with CPython interpreter + stdlib currently takes ~6.4MB to download (and 4 to 5 s to load). A fair amount of that is due to the size of pure Python files in the stdlib. It’s gzip compressed, but there is very likely still some overhead from extracting individual .py files (and importing them without .pyc), so reducing the size would help.

There are potentially several approaches,

Use a minifier such as python-minifier. There the question is how much minification is too much. For instance I imagine it might be better to preserve local variable names for tracebacks.
Only ship .pyc files. This does reduce size and possibly would help performance. This post from 10 years ago suggests that it would likely be very brittle, I’m not sure how up to up to date that analysis is. Also anyone knows the performance impact of not writing .pyc files?
Remove some of the infrequently used modules (and package them as standalone packages). Related to a long thread about the stdlib here. Then one can’t really say that stdlib is included though. We are already removing some stdlib modules that don’t make sense the browser but the cost/benefit for removing more is not clear.

Another constraint is that a fair amount of use is interactive, so keeping docstrings, for instance is still very useful.

Are there other things that could be attempted (when building CPython from sources)? Any feedback would be much appreciated.

cc @brettcannon

encukou · April 26, 2021, 10:48pm

Also see here: Disk space minimization for Python distributors

Here’s an easy thing that might save you ~0.3 MB. We haven’t had trouble with it in Fedora:

large auto-generated files ( pydoc_data/topics.py and several encoding modules) are shipped as .pyc only, without source, since the source is not very informative

Your remove_modules.txt doesn’t list test and other Lib/*/tests. That’s implied, right?

Is there?
importlib should never look at the contents of .py files if an up-to-date .pyc is available.

steven.daprano · April 26, 2021, 11:33pm

How do you know that the poor startup performance is due to the std lib?

According to the documentation:

“Pyodide brings the Python 3.8 runtime to the browser via WebAssembly,
along with the Python scientific stack including NumPy, Pandas,
Matplotlib, SciPy, and scikit-learn.”

I tried the REPL. and as you said it took about 5 seconds to load, which
I felt was quite acceptable on my PC and internet connection. I then
tried import numpy and import scipy and they loaded instantly, with
no visible delay.

When I try the same on my local system, loading the REPL is
instantaneous, after which the imports involve a visible pause (half a
second?) each.

So my guess is that you have preloaded the entire scientific stack,
numpy, scipy, pandas, etc, which is why importing them is
instantaneous. Am I right?

I don’t know how WebAssembly works, but my guess is that the entire
Python runtime, std lib and third party libraries and all, are loaded
into the browser. Certainly there was no pause long enough to suggest
that the Pyodide REPL was loading numpy and scipy over the network when
I ran the imports, not unless my internet connection has suddenly become
faster than my local SSD. (Fat chance.)

I don’t think you should worry about minifying the std lib until you
have looked at the impact of preloading the huge set of third-party
libraries that you ship.

Blackward · April 27, 2021, 1:11am

Hi Roman,

first let me say, that I was quite happy to recently learn, that Pyodide is alive again, that is great news! Great Project! Keep on guys…

I learned, you should not do that. Just delivering the .pyc files might reduce the size significantly but could lead to a vast number of small and hard to handle incompatibilities, right? I personally would prefer waiting some seconds at startup over all that hassle with potential incompatibilities…
I would not do that too. Stay compatible with the standard CPython distribution here! But what could be done is loading modules just ON DEMAND. That’s the normal way to avoid huge startup times - if you ask me…

What I mean, your idea to split the standard lib in smaller parts is not bad, but the user should not notice that - as said smaller parts are loaded as soon as they are needed (and not at startup) - behind the scenes - or alike.

You e.g. also could load the popular parts of the stdlib at startup and those parts, you proposed to be deleted from the stdlib, could be loaded just when they are needed ==> Fast AND compatible solution.

Cheers, Dominik

rth · April 27, 2021, 10:04am

Thanks for the feedback!

Your remove_modules.txt doesn’t list test and other Lib/*/tests . That’s implied, right?

@encukou Yeah, those are excluded separately so that we can package test as a standalone package and run the test suite.

importlib should never look at the contents of .py files if an up-to-date .pyc is available.

Right, but we are not shipping any .pyc currently and have disabled their creation at runtime.
Thanks for linking the other discussion and suggestions!

So my guess is that you have preloaded the entire scientific stack,
numpy, scipy, pandas, etc, which is why importing them is
instantaneous. Am I right?

@steven.daprano No, I’m not sure why you experienced that. The base setup definitely doesn’t bundle any of those packages, however they will be loaded when you import them. You can open the browser Console with F12 (then go to the Network tab) to see the corresponding network requests.

Stay compatible with the standard CPython distribution here! But what could be done is loading modules just ON DEMAND

@Blackward So yes that indeed a possibility. The issue is how do you detect that a stdlib module loading. In user code we are currently parsing imports and loading the packages if available, however if some other stdlib module (or external package) that would likely require to write some custom import hooks (and would also make the execution time a bit less predictable).

J-M0 · April 28, 2021, 8:32pm

I wanted to point out a couple of things about that article:

When it says “Python distribution,” it’s talking about distributing a package, like requests, not whole a Python interpreter + stdlib
The part that is brittle is that Python bytecode can change between Python versions, so compiled .pyc files are not portable between Python 3.8 and 3.9 or potentially even 3.8.1 and 3.8.2.

The post also says a good use case for a .pyc only distribution is a constrained system that you control. I’d say a Python repl running in a browser qualifies as that.

If you control the version of the Python interpreter used, I don’t see why you couldn’t ship a .pyc only stdlib if it saved you space.

Blackward · April 28, 2021, 8:44pm

Hi Roman,

one might be able to hook into the import process without hooking in - if you know what I mean:)

Let me explain this using an example; please note, that it is just thought as an example for the general idea - it might not be THE solution. So, do not take it literally…:

main.py:

import exampleModule

print (exampleModule.exampleVariable)
exampleModule.ExampleClass()

exampleModul.py:

#this is just an empty dummy module
#which loads the original module on demand

#downloading the original module could be done
#before the following line
from originalExampleModul import *

originalExampleModul.py:

exampleVariable = 123

class ExampleClass(object):
      def __init__(self):
          print(456)

Maybe something like that?

Cheers, Dominik

Blackward · April 28, 2021, 8:55pm

Hi James,

There indeed is some truth in that.

I am not familiar enough with how Pyodide works in detail yet - so, may I ask, is the version of the Python interpreter under the SOLE control of those developing resp. packaging Pyodide?

Cheers, Dominik

rth · May 3, 2021, 6:45pm

Thanks for the clarifications and ideas!

Interesting idea. If one has to do that for a full tree hierarchy in many modules I imagine it could get somewhat complex. It might be easier to go with import hooks as less invasive.

I am not familiar enough with how Pyodide works in detail yet - so, may I ask, is the version of the Python interpreter under the SOLE control of those developing resp. packaging Pyodide?

Well anyone can build CPython packaged for Pyodide, and I’m not fully sure what’s the binary compatibility of build A with build B using a different minor CPython version would be, but more compatibility between versions wouldn’t hurt.

In the end, for now we went with making some of the larger stdlib packages (distutils, email, and possible a couple of others) optional but loaded by default (and users can opt in to load them as as a standalone package if size is an issue).

Thanks for all the feedback!