Move umarshal.py to the standard library

The cpython tree currently has a pure python pyc reader in Tools/scripts/umarshal.py. This would be useful as part of the standard library (possibly inlined in dis.py if we don’t want multiple modules dealing with bytecode).

My primary use case for proposing this is because pytype (which performs static analysis of python bytecode) depends on it, but I feel like more tools would potentially make use of bytecode analysis if the building blocks for it were better supported. In this particular case, PyPy also includes a marshal reader, which it would not need to if there were one in the stdlib.

I foresee two objections to this - that the marshal/pyc format is an internal implementation detail that is version dependent and that needs to be free to change, and that a library like this could be its own package rather than being in the stdlib. But it is precisely because the format is version-dependent that I feel it should go into the stdlib, where it can evolve in parallel with the interpreter. And as for being an internal implementation detail; the pyc file is still the most convenient place where tools can get at the compiled bytecode, and effectively stand on the shoulders of all the work that has gone into the compiler - at a higher level, I feel strongly that the python ecosystem in general should consider the bytecode a “first-class” (albeit version-dependent) artefact that is useful for tooling, and would love to work towards making that happen.

Anecdotally, while pytype’s symbolic execution approach is currently unusual among python typecheckers, it has worked out well for us, and is at the least a useful approach to source tooling. Another good example from a different language is js_of_ocaml, which compiles ocaml bytecode to javascript rather than work with the source code, and which therefore benefits from all the work that the compiler puts in.

1 Like

Hi Martin (long time no see :-).

I have a different objection. When you’re using the Python version that wrote the code object, you don’t need umarshal.py – you can just use the marshal module, which does the same thing but faster, written in C, and it creates real code objects.

The umarshal.py file in Tools/build (not Tools/script!) is useful because it can be executed by a different version of Python. Suppose you’re writing a utility that can be run using e.g. Python 3.9 but must analyze .pyc files written by Python 3.12. You can’t use the 3.9 marshal module – it would crash because the code object format has changed a lot between 3.9 and 3.12. There is no way you can import the 3.12 marshal module in your 3.9 program either. But you can import the umarshal module.

If we added umarshal.py to the 3.12 stdlib, that wouldn’t help your 3.9 program, since it’s still not in the 3.9 stdlib. Adding the directory containing the 3.12 stdlib to the 3.9 sys.path would just cause endless confusion, because it would try to import other things into 3.9 that were written for 3.12.

So your best bet, even if it was installed in the 3.12 stdlib, would be to copy umarshal.py into your own code base.

Hi Guido!

That’s a good point, I would indeed not have the right umarshal version available in the host python stdlib. It would possibly make more sense to have an umbrella umarshal module with a collection of readers for different python versions, and while I would like to see that in the stdlib anyway I realise there is a lot less precedent for maintaining stdlib code that explicitly deals with older versions of python. I’ll experiment with creating a umarshal package on pypi first.

1 Like

Yeah, that definitely sounds like something that could live on PyPI. I’d be happy to contribute all the versions of umarshal.py that you could find in the cpython history (though I think the oldest version is for 3.11).

Can the marshal format change with releases? Like from 3.11.0 to 3.11.1?

FWIW, we maintain marshalparser for Fedora, where we sometimes need to postprocess pyc files (to make them bit-for-bit equivalent with different builds of Python).
I wouldn’t object to expanding it to handle other use cases, if that would help. (It may be easier to start your own project, but it then needs to be kept up to date for each new release. We’ll at least verify there are no new changes, for each release.)

Yes.
The basic format doesn’t change often, but code objects are version-specific and marshal.version doesn’t change when their fields are added/removed. So for .pyc, the bytecode magic number effectively also versions the marshal format.

2 Likes