What information is useful to know statically about an interpreter?

indygreg · April 9, 2023, 2:06am

As part of building PyOxidizer I had to gradually add metadata to python-build-standalone’s PYTHON.json files (Distribution Archives — python-build-standalone documentation) to support functionality. You can see the evolution by searching the linked docs for or above only to see features introduced in subsequent versions. Or just download the .tar.zst archives from Release 20230116 · indygreg/python-build-standalone · GitHub and look at them for examples.

In newer versions of PyOxidizer, I support some forms of cross-compiling. This means that (from Rust) we need to glean information about the Python distribution/interpreter that will run on the target machine. That interpreter cannot be run on the current machine, so you can’t evaluate Python code to discover metadata. Hence why all of this metadata is captured in a standalone JSON file.

Here’s some examples of metadata I needed to add and why:

Python platform tags so you can find compatible wheels. All the metadata so you can invoke pip download with --platform, --python-version, --implementation, --abi, etc to find wheels that are compatible with this distribution.
sys.implementation.cache_tag (and other bytecode related properties such as the magic number) so you can create bytecode files for foreign platforms.
sysconfig installation paths so you can discover Python modules, bytecode, other files in the distribution.
Some sysconfig.get_config_vars() entries so you can build extension modules, link a custom libpython, Notably, PyO3 needs Py_DEBUG, Py_TRACE_REFS, and some other flags so its generated C API bindings have the appropriate struct layouts.
importlib.machinery.*_SUFFIXES values so you can categorize files in the distribution. Also allows you to categorize files in any wheels you may encounter (in case you want to reinvent wheel installing without running Python/pip).
Apple SDK metadata (name, platform, version, deployment target) so you can attempt to use a compatible Apple SDK when building Mach-O binaries with identical targeting requirements.
Licensing metadata of components. This allows PyOxidizer to automatically strip copyleft components and display a licensing report to help you conform to licensing requirements when (re)distributing software.
tcl/tk resource file path so you can find these support files.
List of stdlib packages related to tests so they can be deleted to not waste space.
Path(s) to libpython so you can easily copy it.
Metadata defining _PyImport_Inittab and how to compile it in case you want to provide your own set of built-in extension modules. This includes metadata about each extension module, including the name of its initialization function.

I think a good litmus test for is the set of metadata sufficient satisfies these scenarios:

a) I can reimplement pip without using Python (finding compatible wheels, building extensions, installing files at appropriate location in filesystem, find installed packages, etc).
b) Given just the path to a Python interpreter and its metadata, I can load its libpython into the current process and initialize and run a working Python interpreter using the C API. (Assume you have awareness of the C API for all versions of Python and can dynamically generate ABI compatible bindings appropriate for the interpreter being used.)