[WASM] Unvendoring some of stdlib modules

ryanking13 · August 9, 2022, 4:56am

The size of Python application is critical in browser environments.

In Pyodide, we are using “unvendoring” approach: we remove some of the infrequently used stdlib modules from the core Python distribution, then package them as standalone packages.
This helps core Python runtime can be downloaded fast, while unvendored stdlib can be downloaded and installed when needed.

For example, the size of “test” module is larger than 20MB, so Pyodide unvendors it in default. We are thinking of removing more optional modules to minify the size, such as modules that are planned to be deprecated (PEP594), or some large modules like sqlite3 and cjk codecs.

The problem is that this approach might break people’s assumptions that what the stdlib should contain.

So what I want to discuss is,

Is this approach scalable?
How can we decide whether a stdlib module is “optional”?

Previous discussion on this topic: Minifying the stdlib (in Pyodide)

cc @rth, @hoodmane, @tiran

EpicWink · August 9, 2022, 5:35am

A quick idea you’ve likely already considered, you could add an import hook which checks for known removed standard-library modules on import failure, then install those packages from a repository and retry the import. Depending on the application (eg first-time Python user), this could be desirable.

encukou · August 9, 2022, 8:09am

I wouldn’t call it (un)vendoring if it’s made by the same team as the rest.

In Fedora we package these separately:

test – it’s big and not generally useful
tkinter (and idle, turtle ,turtledemo) – these depend on graphics, which aren’t wanted on servers

This works because it’s been done that way for a long time. The issue with splitting more of the stdlib out is that there’s no way to declare dependencies on individual stdlib modules. Users will see things fail “randomly” when they install or update some indirect dependency. So I don’t think it’s scalable now.

I think it would be enough to make imports fail with “module sqlite3 is not included in Pyodide base, install it separately by doing X, see Y for more details” to make users happy, as long as it’s rare to get too many of those messages in a row. The deferred PEP 534 proposes doing that. I still think it’s a good idea, it just hasn’t made it to the top of anyone’s TODO list yet. The main blocker is now solved, even.

In Fedora we also also ship .pyc only (no .py source) for large autogenerated files: pydoc_data/topics.py and encodings generated by gencodec.py.

rth · August 9, 2022, 9:00am

you could add an import hook which checks for known removed standard-library modules on import failure

We can’t currently do that easily because import is sync and package installation is async, while WASM doesn’t yet support stack switching pyodide#2664 so we cannot make that async call sync via greenlet or something similar. It might be possible to do via a separate webworker thread but it’s more complexity / overhead.

The issue with splitting more of the stdlib out is that there’s no way to declare dependencies on individual stdlib modules […] So I don’t think it’s scalable now.

Well in the packages that are part of the Pyodide distribution we do add these modules as extra dependencies. So in that case dependency resolution, works, but it’s more work and a) it’s indeed still an issue with packages installed from PyPi b) as more and more packages are unvendored, associated issues will become more frequent and I’m also concerned about the scalability of this approach.
So it’s a bit of a compromise, currently we mostly unvendor stdlib modules that are going to be deprecated in the future (e.g. distutils).

I think it would be enough to make imports fail with “module sqlite3 is not included in PyIodide base, install it separately by doing X, see Y for more details”

Interesting idea, thanks! Maybe that would indeed be better than regular ImportErrors

erlendaasland · August 9, 2022, 9:02am

How large is the sqlite3 module?

rth · August 9, 2022, 9:06am

Good to know. @tiran was also saying they are doing this in WASM builds. Would someone mind summarizing what are the potential limitations of this (also for packages), if any? As far as I understand .pyc are only compatibles for a given Python minor version 3.X, so we cannot do that for universal wheels but we could do it for stdlib & binary wheels? Is compatibility for patch versions (3.11.X ) guaranteed? Also in case of tracebacks does one get the original line number of the original .py files are not present?

cc @antocuni

encukou · August 9, 2022, 9:29am

Would someone mind summarizing what are the potential limitations of this (also for packages), if any?

There’s no source, so you won’t see source lines in tracebacks, and I assume the new range underlines also aren’t there. This makes debugging quite a bit harder (especially if you use a slightly different version of the source by mistake). That’s why we do it only where the source is very boring.
And it can’t be used with different minor versions of CPython, and different implementations.
As far as I understand .pyc are only compatibles for a given Python minor version 3.X, so we cannot do that for universal wheels but we could do it for stdlib & binary wheels?

I’ve never heard of stdlib wheels.
You can do it for anything that is only installed for a specific CPython minor version.
Is compatibility for patch versions (3.11.X ) guaranteed?

Yes (except Alphas/Betas of 3.Y.0).
They’re also architecture-independent, though that probably doesn’t matter for Wasm.
Also in case of tracebacks does one get the original line number of the original .py files are not present?

Yes, you do get the line number.

You might be also interested in an alternate install scheme – use/ship .pyc primarily, but make source available – that I proposed and plan to get back to eventually. For Pyodide this could be helpful if fetching source (e.g. displaying tracebacks) could be made async.

tiran · August 9, 2022, 9:30am

Tracebacks for pyc-only contain correct line numbers. They lack the Python code line.

with Python files

>>> sysconfig.get_paths('invalid')
>>> sysconfig.get_paths('invalid')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/heimes/dev/python/3.11/Lib/sysconfig.py", line 616, in get_paths
    return _expand_vars(scheme, vars)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/heimes/dev/python/3.11/Lib/sysconfig.py", line 272, in _expand_vars
    for key, value in _INSTALL_SCHEMES[scheme].items():
                      ~~~~~~~~~~~~~~~~^^^^^^^^
KeyError: 'invalid'

pyc only

>>> sysconfig.get_paths('invalid')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/runner/work/cpython-wasm-test/cpython-wasm-test/cpython/Lib/sysconfig.py", line 616, in get_paths
  File "/home/runner/work/cpython-wasm-test/cpython-wasm-test/cpython/Lib/sysconfig.py", line 272, in _expand_vars
KeyError: 'invalid'

In this example the file name is also incorrect. That should be easy to fix.

rth · August 9, 2022, 9:50am

Tracebacks for pyc -only contain correct line numbers

Thanks for the explanations! So it’s not so bad then, users can still search for the exception message. Indeed providing source files separately would be ideal, but it also sounds a bit more complex to setup. The reason we didn’t ship .pyc initially because we thought the gain in download size would be compensated by the gzip compression of downloaded files (and the file system is also lz4 compressed). But that still leaves the extra compilation time indeed.

That was an unfortunate wording, I meant stdlib and package wheels separately.

PyIodide

BTW it’s Pyodide without the I. The name is confusing )

rth · August 9, 2022, 10:03am

According to @ryanking13 's work in pyodide#2946 unvedoring it reduces the WASM file size by 1.4 MB uncompressed (or ~500kB gzip compressed) so it is quite significant.

Encodings is another large module (~1.7MB uncompressed) where likely only a small part is probably used in practice by most users. pyodide#1542

So it’s a bit of a dilemma because it’s rather unfortunate to have to download large extra (very likely unused) modules on web pages. But at the same time unvendoring them can break some assumptions about what the stdlib should be.

tiran · August 9, 2022, 10:04am

Byte code compilation impacts startup performance a lot. The problem with lack of pyc cache on Pyodide is one of the problem I wanted to discuss with you and Antonio. One idea I had was wheels with pre-compiled pyc files. You already have a CDN for your binary WASM builds. Maybe you could also provide “binary” wheels for pure Python packages.

@pradyunsg What would be the correct name for a pure Python wheel that contains CPython-specific pyc files instead of py files? spam-1.0.0-cp311-none-any.whl, spam-1.0.0-cp311-cp311-any.whl, or perhaps spam-1.0.0-cp311-cp311_pyc-any.whl?

rth · August 9, 2022, 10:07am

Maybe let’s open a separate thread about that? Here or in Pyodide GH as you prefer.

brettcannon · August 9, 2022, 11:39pm

There really isn’t a good way except making a decision as to what you think your audience can live without.

Mostly. Technically it’s tied to the bytecode “magic” number embedded in the header of a .pyc file and while we try not to change it in a bugfix, it can happen and that means rewriting the file (CPython only ever supports a single bytecode magic number). But usually this only comes up during alphas and betas.

The first one: cp310-none-any as there are no ABI or platform restrictions, but you are restricted to a specific interpreter version. And that tag triple comes before any other *-none-any wheels in terms of priority.

antocuni · August 31, 2022, 9:34pm

sorry, the notification email was buried below many others and I saw this thread only now. Let me give you my 2 cents even if I’m late in the conversation.
Being able to download stripped-down versions of the stdlib would always be very useful, but I agree that finding the right balance of size vs completeness is tricky.
For PyScript, I can imagine that for some use cases we can introduce a way to record/compute the set of modules which are actually used for a give page/app and build a custom bundle which contain only those, but there are other use cases in which you want to have “everything” available, e.g. notebook-like and interactive applications.

That said, let me comment on some of the topics:

Yes, this sounds like a very good idea to me. Even better if we can provide an easy way to add platform-specific information on how to solve the issue, although overriding sys.excepthook might be enough.

If I understand correctly, the issue is that even if we start an async download, it won’t be completed until we yield control to the event loop of the main JS thread, which cannot happen until we complete the import. Is it correct?

A note about how source code is displayed inside tracebacks. Eventually, the translation from “filename + line number” into “line of text” happens by calling linecache.getline:

github.com

python/cpython/blob/29f1b0bb1ff73dcc28f0ca7e11794141b6de58c9/Lib/traceback.py#L317-L323


      
          @property
          def line(self):
              if self._line is None:
                  if self.lineno is None:
                      return None
                  self._line = linecache.getline(self.filename, self.lineno)
              return self._line.strip()

So, it is in theory possible to display full tracebacks without having to download the whole files, but only the few lines which are actually needed, by teaching linecache.getline how to fetch them.
AFAIK this approach is used by the good old pylib’s py.code.Source to provide readable tracebacks of code which generated at runtime and exec()ed.
To be really advanced, I can imagine the following:

pre-compute a table which given a filename and a line number, gives you the position inside the file in bytes
distribute that table (which is probably very small) with cpython-wasm/pyodide/pyscript/whatever
fetch the desired lines on the fly directly from github, using HTTP range requests (assuming that github supports range requests).
If we are lucky, the browsers are smart enough to know how to cache those requests to avoid to download the same bytes again and again

Something which has been floating in my head for a while but I haven’t had time to play with is a WASM-specific binary format for pyc files. Currently, you have to download pyc files, write the bytes in the virtual filesystem, load the bytes from the FS and unmarshal them in order to build the in-memory data structures needed by python. I have a vague feeling it should be possible to “deep freeze” modules as described here and embed them into a WASM module which can be loaded directly in the browser – no unmarshaling needed – but I admit I never thought too much about it.

tiran · September 1, 2022, 6:18am

Deep freezing is a possible apprach. It’s only practical for CPython stdlib modules, though. I haven’t tested how deepfreezing affects the output size. My gut feeling is that deepfreezing increases the resulting binary.

Another approach is Wizer. It freezes pre-initialized memory into the WASM binary.

antocuni · September 1, 2022, 6:29am

why only stdlib modules?
To be clearer, I’m not talking about deepfreezing modules inside the main WASM executable.
I’m talking to use the raw memory obtained by deepfreezing modules as a serialization format which can be “easily” and quickly loaded in WASM memory.

This would be relatively easy if WASM supported supported multiple memories, but that’s not the case yet.

It’s likely that in order to do that with the current capabilities we need to write some kind of relocation logic, because we cannot know the final memory addresses where these structs will reside at "compilation/deepfreezing time’, but it will only be known at import time.
In principle it sounds doable; how complex it is and whether it’s worth, I don’t know :).

yes, this is exactly the sort of magic which I was thinking about. Glad to see that it already exists