What information is useful to know statically about an interpreter?

brettcannon · April 7, 2023, 10:01pm

This is somewhat related, but also somewhat tangential to PEP 711: PyBI: a standard format for distributing Python Binaries . Over there, @njs is suggesting recording what is necessary to resolve dependencies for an interpreter (previous discussed in What information is needed to choose the right dependency (file) for a platform?). From the perspective of VS Code and the Python Launcher for Unix, I’m also interested in other details about the interpreter the help people understand what interpreter they are selecting and how to use the interpreter that can be recorded statically (i.e., what to show users in VS Code and how to execute their code).

I have been asked a few times to bring forward my thoughts on this here as I have a sketch of the details the Python Launcher for Unix and VS Code would be interested in at Support a way for other tools to assist in environment/interpreter discovery · brettcannon/python-launcher · Discussion #168 · GitHub . With PEP 711 probably about to spark some conversation, I figured now might be a good time to discuss this and see if there’s any chance of getting a unified set of data that we could record about an interpreter so we can get some interoperability around it for tools (both producers and consumers of it).

For ease of reading, my proposal of interpreter details is outlined by the following:

// Returning an array forces the details to be fully self-contained. This facilitates any library that
// may return a collection of these results by not requiring any post-processing on what a locator
// returns; chaining results from multiple sources is all that is required.
[
    {
        // A unique identifier for the interpreter.
        // The key should be unambiguous but as weakly specified as possible, i.e. to the
        // directory for environments, but to the specific interpreter for something found on `PATH`
        // (i.e. `python3.11`, not just `python3`). That way tools can generally agree on the same key
        // w/o coordination. (Open question whether this should be to the most specific path instead.)
        "path_id": "/home/brett/my-venvs/my-venv",
        "python_version": {
            // `sys.version_info`
            "major": 3, // Optional
            "minor": 10, // Optional
            "micro": 1, // Optional
            "releaselevel": "final", // Optional
            "serial": 0 // Optional
        },
        "implementation": { // Optional
            // `sys.implementation`
            "name": "cpython",
            "version": { // Has the same structure as `python_version` above.
                "major": 3, // Optional
                "minor": 10, // Optional
                "micro": 1, // Optional
                "releaselevel": "final", // Optional
                "serial": 0 // Optional
            }
        },
        "executable": {
            // An array specifying what is required to execute the interpreter.
            // Expectation is to append args to code to the end of the array before
            // execution.
            // E.g. for conda environments:
            // ```
            // ["/path/to/conda", "run", "--path",
            //  "/home/brett/.conda/envs/conda-env", "--no-capture-output"]
            // ```.
            "run": [
                "/home/brett/my-venvs/my-venv/bin/python3.10"
            ],
            "bits": 64, // Optional
            "architecture": "x86-64"  // Optional
        },
        "environment": { // Optional
            // What type of environment, e.g. "virtual", "conda", etc.
            "type": "virtual",
            "name": "my-venv"  // Assume the directory if no specific name.
        },
        // Is the result specific to the workspace?
        "context-sensitive": true,
        // Who created this result.
        "locator_name": "Python Launcher"  // Optional (?)
    }
]

All the data related to the interpreter should be retrievable if some code was run. Everything is optional so that as much data can be provided quickly, but then potentially filled in later if needed (e.g., the Python Launcher technically just cares about executable.run, while the version details are simply helpful in making better decisions).

Adding in packaging details wouldn’t be hard and could make sense, but since I started from the perspective of execution and display details I didn’t worry about that (yet). Plus I figured it could be added later as more optional data.

I will say that if this data every ships with every Python interpreter and can simply be read from a file that would be amazing for my use cases.

MRAB · April 7, 2023, 10:25pm

What about the OS?

brettcannon · April 7, 2023, 10:56pm

Since it’s running on the machine I had not worried about it. If we record all of the packaging-related details (i.e. wheel tags and markers) then it would be implicitly captured by that.

barry-scott · April 8, 2023, 7:43am

Native int size, 32 or 64 bits.

njs · April 8, 2023, 9:21am

Given that the interpreters you want information about are running on the same system you are, why do you want static metadata (as opposed to querying the interpreter itself by invoking it)? Just speed? Is there an obstacle to caching it?

What do you want to know that’s not in the PEP 711 draft? e.g. you can definitely reconstruct architecture, implementation version, python version, architecture…

barry-scott · April 8, 2023, 5:01pm

You are suggesting that I can use the win32 vs. win_amd64 as the way to know if its 32 bit vs. 64 bit.

On macOS 32 bit has gone - I have to work out the arm vs. intel from the tags.

Is there any linux 32 bit platforms that matter, will need support, anymore?
Fedora still has some 32 bit RPMs for KVM I think. But otherwise is 64 bit.

indygreg · April 9, 2023, 2:06am

As part of building PyOxidizer I had to gradually add metadata to python-build-standalone’s PYTHON.json files (Distribution Archives — python-build-standalone documentation) to support functionality. You can see the evolution by searching the linked docs for or above only to see features introduced in subsequent versions. Or just download the .tar.zst archives from Release 20230116 · indygreg/python-build-standalone · GitHub and look at them for examples.

In newer versions of PyOxidizer, I support some forms of cross-compiling. This means that (from Rust) we need to glean information about the Python distribution/interpreter that will run on the target machine. That interpreter cannot be run on the current machine, so you can’t evaluate Python code to discover metadata. Hence why all of this metadata is captured in a standalone JSON file.

Here’s some examples of metadata I needed to add and why:

Python platform tags so you can find compatible wheels. All the metadata so you can invoke pip download with --platform, --python-version, --implementation, --abi, etc to find wheels that are compatible with this distribution.
sys.implementation.cache_tag (and other bytecode related properties such as the magic number) so you can create bytecode files for foreign platforms.
sysconfig installation paths so you can discover Python modules, bytecode, other files in the distribution.
Some sysconfig.get_config_vars() entries so you can build extension modules, link a custom libpython, Notably, PyO3 needs Py_DEBUG, Py_TRACE_REFS, and some other flags so its generated C API bindings have the appropriate struct layouts.
importlib.machinery.*_SUFFIXES values so you can categorize files in the distribution. Also allows you to categorize files in any wheels you may encounter (in case you want to reinvent wheel installing without running Python/pip).
Apple SDK metadata (name, platform, version, deployment target) so you can attempt to use a compatible Apple SDK when building Mach-O binaries with identical targeting requirements.
Licensing metadata of components. This allows PyOxidizer to automatically strip copyleft components and display a licensing report to help you conform to licensing requirements when (re)distributing software.
tcl/tk resource file path so you can find these support files.
List of stdlib packages related to tests so they can be deleted to not waste space.
Path(s) to libpython so you can easily copy it.
Metadata defining _PyImport_Inittab and how to compile it in case you want to provide your own set of built-in extension modules. This includes metadata about each extension module, including the name of its initialization function.

I think a good litmus test for is the set of metadata sufficient satisfies these scenarios:

a) I can reimplement pip without using Python (finding compatible wheels, building extensions, installing files at appropriate location in filesystem, find installed packages, etc).
b) Given just the path to a Python interpreter and its metadata, I can load its libpython into the current process and initialize and run a working Python interpreter using the C API. (Assume you have awareness of the C API for all versions of Python and can dynamically generate ABI compatible bindings appropriate for the interpreter being used.)

konstin · April 9, 2023, 7:06am

For maturin and monotrail:

python version (major/minor), python implementation, os (and libc), architecture; effectively wheel tags
PEP 508 metadata
Paths to the interpreter and the shared library

A json or toml file with this information would be extremely useful! This is a lot faster and less error prone than launching a python interpreter, and there some cases such as cross compiling where you don’t want to run python at all (pyo3 used to parse header files for specific cases)

Otherwise what @indygreg wrote, those points are a subset of his.

njs · April 9, 2023, 7:09am

tbh I was mostly responding Brett b/c we were already talking in the pep 711 thread :-). But yeah, you can do that – does it work for your use case? I’m not sure what you’re doing with this information.

32-bit arm is still in use, eg on raspberry pis.

Oh, this is fantastic. Did you see the PEP 711 thread? Do you want to team up on anything there? I don’t think PyBIs can totally replace the python-build-standalone distributions – in particular the build artifacts for people to re-link – but I think it would be awesome if pyxodizer could handle at least some situations by consuming official Python releases from PyPI.

Huh, I didn’t know it was even possible to compile byte-code for foreign platforms. How does that work?

I guess in principle the PEP 425 tags inside the PyBI metadata have everything you need to determine struct layouts, since by definition two Pythons with incompatible ABIs have to have different tags. Is that practical, or would it be too onerous to compute that way and it’s better to store the info directly?

I don’t understand what these are used for… isn’t the extension for python files always going to be .py? And my code for installing wheels runs without Python/pip, and doesn’t use these, so I don’t follow that part either.

Huh, that makes a lot of sense actually. makes a note

Good point. We do have PEP 639 to encode SPDX tags and license text (though I guess it’s still draft? @brettcannon is this still cooking?) But it doesn’t have any way to associate licenses with specific paths inside the archive. Maybe it should? I guess most projects wouldn’t bother, but it seems like generally useful to allow the option, and for specific projects like CPython it probably would be worth the effort.

I have no idea what tcl/tk resources are what makes them special?

Is this not just {stdlib}/test?

I’m not 100% sure I understand what this is, but it sounds like it’s only relevant when working from the intermediate build artifacts – is that right?

As far as these go, I think the PEP 711 draft already satisfies them, except for missing the path to libpython. But you need more for cross-compiling, subsetting (ie stripping GPL components or tests), and… whatever the use cases you’re supporting with tcl/tk metadata and SUFFIXES

indygreg · April 9, 2023, 8:31pm

Yes. I have a reply drafted and figuring I’d start posting to this thread first. There might be room for us to collaborate here. But I’m a bit stretched for time, so I can’t promise anything. I can commit to one-off meetings or reviews. But open-ended time for coding is precious for me at the moment.

TBH the relinking is more complexity than is healthy for most users. Some people do want single file binaries. But for the common case I imagine PyOxidizer evolving to support a static, PyBI-like distribution where all PyOxidizer does is collect dependencies (often via invoking pip) and emit an application / driver binary with the Rust main().

I don’t have a strong opinion. You should ask the PyO3 maintainers.

I’m not yet doing too much of this. Just like Linux → Linux cross-compilation. But in theory I believe marshaling is bit-identical for the same CPython version regardless of platform. I might be wrong about that though.

Yeah, Python source and bytecode is easy and can be hardcoded. It is the extension modules where things get wonky. You could hardcode heuristics here as well. But out of principle the naming scheme is parameterized at interpreter build time and people could customize it. So IMO it is best to just read the metadata from the interpreter so there’s no potential for disagreement.

While we agree that we don’t want to expose object data and other metadata to facilitate relinking libpython, licensing metadata does require annotating each extension module’s license and library dependencies and their licensing metadata. This is probably best done by annotating paths: you don’t want consumers to have to use heuristics to figure out the path to a .so/.dll because library names are highly platform specific. i.e. this is avoidable complexity on consumers.

The tkinter module requires various .tcl and other support files to work. On Linux, some of these are supplied by the system tcl/tk packages. On macOS framework distributions and Windows, the CPython distributions provide them themselves. python-build-standalone distributes them on all flavors. If you don’t support tkinter, you can ignore these support files.

If only it were that easy. There are other packages like bsddb.test, ctypes.test, email.test, unittest.test, and more. I currently manually annotate these in python-build-standalone. I wish I could get the annotation from sysconfig metadata.

Yes. Ignore this unless you want to enable relinking a custom libpython.

njs · April 10, 2023, 2:55am

@konstin? (Question is whether PEP 425 ABI tags are good enough for pyo3 to figure out what struct layouts to use, or if you need something more.)

Yeah, file extensions for extension modules are complicated, that makes sense. But why do you need to be able to identify extension modules on the filesystem?

mattip · April 10, 2023, 7:16am

PyO3 and other build systems like nanobind, meson, and CMake create extension modules, so they need to know what to name the file they create, especially when cross-compiling.

brettcannon · April 11, 2023, 12:17am

See executable.bits.

Correct. Having to ask every interpreter and environment for these details can be costly (we have spent a huge amount of time in VS Code trying to improve the performance around getting this information because some people have hundreds of environments installed; I think over 700 is the highest reported in an issue that I can remember).

With very robust caching, probably not, but that “robust” bit is part of the difficulty. For instance, if you update your Python 3.11 interpreter from 3.11.0 to 3.11.2, then you can’t rely on the path or anything to tie back to the cached details. Hopefully mtime and file size? But it also just plain sucks on first execution to gather this information even if you do cache it (once again, speaking from experience about how impatient users are).

For instance, I didn’t see any implementation details in PEP 711. It’s totally fine if this metadata is kept separate from the perspective of interpreter execution details compared to install/build details, but if there wasn’t any reasonable way to build on top of one another, that’s the reason for posting this (plus, as I said, various people have asked me to post about this before).

njs · April 11, 2023, 4:21am

Doh, of course! Thank you

What implementation details do you need? I notice the name and version in your JSON above, and those are already included in the pybi metadata, via the environment markers dictionary (check out the full list).

The environment markers are surprisingly useful – eg I realized I didn’t need to add any new metadata to support Requires-Python, because it’s already there in the python_full_version marker.

konstin · April 11, 2023, 5:37pm

I’m not really a pyo3 maintainer anymore (and haven’t touched the build config in quite a while), pinging @davidhewitt instead. fwiw the implementation (in python) is at pyo3/impl_.rs at b4d4904d71fe526b99bd649a66a059c43ebcb4d0 · PyO3/pyo3 · GitHub

steve.dower · April 11, 2023, 6:37pm

For reference, PEP 514 defines the following static metadata for interpreters on Windows:

Install path (sys.prefix)
Executable path (typically, though not necessarily, sys.executable)
Executable arguments (args to pass first when launching the executable)
Windowed versions of the above two
Supported language version (sys.version_info[:3])
Runtime architecture (platform.architecture())
Display name and link for documentation (with support for extra types of documentation)
Display name for the distributor/supplier
Support URL for the distributor/supplier

That seems to have proven itself to be enough for simply launching Python, but it’s not enough for doing things like cross-compiling. We’d need another way to get sysconfig-level options statically.

davidhewitt · April 11, 2023, 6:43pm

There are couple of cases which I think aren’t covered by ABI tags:

Does the interpreter ship with a libpython? Where is it? This is necessary if building a Rust binary which dynamically links to libpython. (Also always necessary to link on Windows and Android.)
I think Python debug builds on Windows don’t use an ABI tag? We had to resort to checking if EXT_SUFFIX starts with _d as a heuristic.

When cross-compiling we can’t run the target interpreter to get all this stuff, so we have resorted to peeking in sysconfigdata files as there’s not a better option we’re aware of.

Have you got an implementation of the kind of distribution you’d want us to support? We could try a branch of PyO3 using that metadata and see what works.

brettcannon · April 11, 2023, 8:38pm

I meant sys.implementation, not a general “implementation details”.

Ah yes, the dual binary “fun”.

That isn’t available anywhere but Windows via the registry right now, although I would like to fix that someday so there’s something in sys for this.

Everything else is already in the proposal.

Depends on how far back you’re looking, but today there is no special ABI suffix for debug builds and haven’t been for several releases.

davidhewitt · April 11, 2023, 9:05pm

I see. I wonder in that case if this windows-specific build logic to add _d to the names of debug binaries can be removed?

brettcannon · April 11, 2023, 9:41pm

That’s a @steve.dower question. It might be different on Windows compared to Unix.