The cpython tree currently has a pure python pyc reader in
Tools/scripts/umarshal.py. This would be useful as part of the standard library (possibly inlined in
dis.py if we don’t want multiple modules dealing with bytecode).
My primary use case for proposing this is because pytype (which performs static analysis of python bytecode) depends on it, but I feel like more tools would potentially make use of bytecode analysis if the building blocks for it were better supported. In this particular case, PyPy also includes a marshal reader, which it would not need to if there were one in the stdlib.
I foresee two objections to this - that the marshal/pyc format is an internal implementation detail that is version dependent and that needs to be free to change, and that a library like this could be its own package rather than being in the stdlib. But it is precisely because the format is version-dependent that I feel it should go into the stdlib, where it can evolve in parallel with the interpreter. And as for being an internal implementation detail; the pyc file is still the most convenient place where tools can get at the compiled bytecode, and effectively stand on the shoulders of all the work that has gone into the compiler - at a higher level, I feel strongly that the python ecosystem in general should consider the bytecode a “first-class” (albeit version-dependent) artefact that is useful for tooling, and would love to work towards making that happen.
Hi Martin (long time no see :-).
I have a different objection. When you’re using the Python version that wrote the code object, you don’t need umarshal.py – you can just use the marshal module, which does the same thing but faster, written in C, and it creates real code objects.
The umarshal.py file in Tools/build (not Tools/script!) is useful because it can be executed by a different version of Python. Suppose you’re writing a utility that can be run using e.g. Python 3.9 but must analyze .pyc files written by Python 3.12. You can’t use the 3.9 marshal module – it would crash because the code object format has changed a lot between 3.9 and 3.12. There is no way you can import the 3.12
marshal module in your 3.9 program either. But you can import the umarshal module.
If we added umarshal.py to the 3.12 stdlib, that wouldn’t help your 3.9 program, since it’s still not in the 3.9 stdlib. Adding the directory containing the 3.12 stdlib to the 3.9
sys.path would just cause endless confusion, because it would try to import other things into 3.9 that were written for 3.12.
So your best bet, even if it was installed in the 3.12 stdlib, would be to copy
umarshal.py into your own code base.
That’s a good point, I would indeed not have the right umarshal version available in the host python stdlib. It would possibly make more sense to have an umbrella umarshal module with a collection of readers for different python versions, and while I would like to see that in the stdlib anyway I realise there is a lot less precedent for maintaining stdlib code that explicitly deals with older versions of python. I’ll experiment with creating a umarshal package on pypi first.
Yeah, that definitely sounds like something that could live on PyPI. I’d be happy to contribute all the versions of umarshal.py that you could find in the cpython history (though I think the oldest version is for 3.11).
Can the marshal format change with releases? Like from 3.11.0 to 3.11.1?
FWIW, we maintain marshalparser for Fedora, where we sometimes need to postprocess
pyc files (to make them bit-for-bit equivalent with different builds of Python).
I wouldn’t object to expanding it to handle other use cases, if that would help. (It may be easier to start your own project, but it then needs to be kept up to date for each new release. We’ll at least verify there are no new changes, for each release.)
The basic format doesn’t change often, but code objects are version-specific and
marshal.version doesn’t change when their fields are added/removed. So for
.pyc, the bytecode magic number effectively also versions the marshal format.