Re-use of standard library across implementations

That openembedded manifest is wonderful.

1 Like

Perhaps, but honestly how many people actually ever navigate a CPython source checkout? And even if you do, what is the motivation to browse by file instead of by documentation? And doing it by physical directory makes any changes later more difficult compared to documentation changes. Basically you’re working against 29 years of development habits to get this changed. :grin:

But you’re assuming that’s something we want to promote. :wink: We already have enough issues with Linux distros like Debian leaving out stuff like venv so I’m not sure if we would want to promote having people claim they support Python while missing significant parts of the stdlib (MicroPython/CircuitPython get away with this since their execution environment is so different from CPython’s). I mean maybe we would be okay with promoting it, but we have not had that discussion.

Basically you would propose a PEP outlining how you would want things to change. E.g. do you want a different folder structure? Do you want to break the stdlib out to its own repository?

Yep! Make sure to read PEP 1 and PEP 12.

This thread is also highly relevant:

1 Like

IMO if you do split the standard library into (n) wheels, and get Python applications into the habit of declaring dependencies per standard library module, you could improve the Debian situation. You would detect the missing module before your application started. Imagine a https://pypi.org/project/shiv/ zip application but with a potential automatic pipx step on first run.

I don’t think that implication holds. Saving the developers of
DodgyPython the work of reimplementing the python-only part of the
stdlib does not mean they have to have write access.

And Windel Bouwman likewise objected:

The developers of DodgyPython can use the shared standard library, but
do not require write access to it. They can copy paste the folder into
their own sourcecode, or they could bundle it upon installer/package
creation.

But that’s precisely what they can do now. If they don’t have write
access, it’s not shared access, is it?

The status quo is that any Python implementation can re-use the Python
only parts of the std lib, all they need to do is “copy paste the
folder into their own sourcecode”, just as you say.

If that’s the only problem you want to solve, the Time Machine strikes
again and its already solved. (At least for the portion of the stdlib
that is in pure Python.)

But just as you said, copying is a one-off process, and the two copies
will eventually get out of sync. To avoid that:

  • each implementation has to periodically refresh their copy of the
    stdlib from the CPython version;

  • and submit any changes they make to their copy back to CPython;

  • and hope that CPython accepts the changes.

In other words, in the world we live in today, CPython’s version of the
std lib is the “master copy” and has a priviledged position as the One
True version of the modules. Anyone can submit PRs to modify the stdlib,
but implementations other than CPython have no priviledged status.

Sharing the code implies that those who share it have equal status. It
won’t be just CPython that has the One True version, all implementations
will have equal write access. Otherwise, it’s not shared, it’s just
copied, which is what they can do now.

If PyPy modify a module, it will automatically be seen by all Python
implementations, not just PyPy.

(I’m not saying that the changes will appear by magic. Presumably they
will appear in seperate branches of seperate repos, or something like
that. The details of interoperability between repos will presumably need
working out. The technical details might be hard to solve, or easy to
solve, I don’t know.)

But if PyPy has write permission to the shared parts of the repo, they
effectively have write permission to the shared parts of everyone’s
repos, since that’s what shared means.

And the same applies to DodgyPython unless we act as gatekeepers,
splitting the world of Python interpreters into “trusted” and
“untrusted” implementations.

(By the way, in case it wasn’t obvious, I made up the name “DodgyPython”
to avoid singling out any actual existing implementation as untrusted.)

You’ll never get enough of the ecosystem to declare their dependencies of specific parts of the stdlib. I am sympathetic to your problem, but the solution should not require any changes for existing Python users (as long as they aren’t relying on the physical layout of the stdlib in the filesystem – games with the default sys.path are to some extent acceptable). And neither is a solution feasible that provides backwards compatibility for some time while deprecating current habits (again excluding reliance on filesystem layout).

That said, proposals that categorize the stdlib into multiple tiers could be useful for a variety of alternative Python implementations that are struggling with finding the resources to support or verify all of the stdlib. Not just PyPy has gone here before, Jython and IronPython are also in this boat, and there are new implementations just around the corner (I’ve heard of something named GrailPython out of Oracle, and there are always people playing with transpiling to JS or WASM).

In terms of your compatibility story you will have to play it similarly to MicroPython and CircuitPython: they claim full compatibility with syntax and builtins of a specific version of Python, but advertise clearly that they have a limited stdlib (not to mention less memory :-).

1 Like

Here’s what openembedded calls “core” Python.

Suppose Python moves all of these in a “core” directory in its source tree. A build step would recreate the current Lib/ directory by copying “core” and everything else into one directory. Bonus points if lib-dynload/ has alternative pure-python implementations.

Now alternative Python’s development process keeps copying core/ and maintains a patch set on top of core/. If two alternative Pythons could share the same system to maintain patched versions of the standard library that would be amazing.

Could be very nice and organized and even give CPython developers a better idea of what they were working on. Alternatively openembedded et al. might have a different idea of and prefer to copy everything into the old Lib/ directory before splitting? Anyone have a clearer vision?

openembedded’s core python:

   ${bindir}/python*[!-config]
   ${includedir}/python${PYTHON_BINABI}/pyconfig*.h
   ${prefix}/lib/python${PYTHON_MAJMIN}/config*/*[!.a]
    UserDict.py
    UserList.py
    UserString.py
    __future__.py
    _abcoll.py
    _bootlocale.py
    _collections_abc.py
    _markupbase.py
    _sitebuiltins.py
    _sysconfigdata*.py
    _weakrefset.py
    abc.py
    argparse.py
    ast.py
    bisect.py
    code.py
    codecs.py
    codeop.py
    collections
    collections/abc.py
    configparser.py
    contextlib.py
    copy.py
    copyreg.py
    csv.py
    dis.py
    encodings
    encodings/aliases.py
    encodings/latin_1.py
    encodings/utf_8.py
    enum.py
    functools.py
    genericpath.py
    getopt.py
    gettext.py
    heapq.py
    imp.py
    importlib
    importlib/_bootstrap.py
    importlib/_bootstrap_external.py
    importlib/abc.py
    importlib/machinery.py
    importlib/util.py
    inspect.py
    io.py
    keyword.py
    lib-dynload/__pycache__/_struct.*.so
    lib-dynload/__pycache__/binascii.*.so
    lib-dynload/__pycache__/time.*.so
    lib-dynload/__pycache__/xreadlines.*.so
    lib-dynload/_bisect.*.so
    lib-dynload/_csv.*.so
    lib-dynload/_heapq.*.so
    lib-dynload/_opcode.*.so
    lib-dynload/_posixsubprocess.*.so
    lib-dynload/_struct.*.so
    lib-dynload/array.*.so
    lib-dynload/binascii.*.so
    lib-dynload/math.*.so
    lib-dynload/parser.*.so
    lib-dynload/readline.*.so
    lib-dynload/select.*.so
    lib-dynload/time.*.so
    lib-dynload/unicodedata.*.so
    lib-dynload/xreadlines.*.so
    linecache.py
    locale.py
    new.py
    ntpath.py
    opcode.py
    operator.py
    optparse.py
    os.py
    platform.py
    posixpath.py
    re.py
    reprlib.py
    rlcompleter.py
    selectors.py
    signal.py
    site.py
    sitecustomize.py
    sre_compile.py
    sre_constants.py
    sre_parse.py
    stat.py
    stringprep.py
    struct.py
    subprocess.py
    symbol.py
    sysconfig.py
    textwrap.py
    threading.py
    token.py
    tokenize.py
    traceback.py
    types.py
    warnings.py
    weakref.py

Something like this is the approach I would suggest. I.e. find all Python modules that can potentially run on other Python VMs. Leave those in Lib and move the rest to some other folder.

In the short term, Lib can remain as part of the CPython repo. Or, we could make it a git submodule. In the long term, I suspect it would be good if Lib could become its own repo with its own release cycle. Having it not so intimately tied to CPython would provide value to the Python community.

Yes. This reduces the barrier of writing an alternative Python VM and that’s a good thing. Also, forcing pure-python versions helps ensure that the module is possible to implement in pure Python (e.g. avoid using low-level CPython features or quirks).

This is already required via PEP 399, so it’s only older modules which no one has bothered to put the time and effort into writing a pure Python port which still need this. But also realize that there has been some pushback in the past in doing this for older modules as CPython doesn’t need it and PyPy already has their RPython equivalents (and they aren’t needed by MicroPython).

I would hope that if we either broke out the stdlib into its own repo or pushed to make it more portable across Python implementations that people wouldn’t object and people would then be up for doing the work.

Also of possible interest, https://github.com/beeware/ouroboros is (edit: claims to be) a pure-Python implementation of the standard library.

2 Likes

Is it? It looks just like a dumb copy of the stdlib from a few years ago. Take a look at e.g. io.py, lzma.py or os.py: they rely quite a bit on the existence of C extension modules not provided in the source tree.

To implement this, I would propose several phases of refactoring.

Phase 1: categorization. Within the CPython repository, split the standard library into multiple folders, each containing a specific functional group of the standard library. Then, during the building process of the installer these libraries must be copied back into a single folder, or the sys.path variable must be extended to include the several folders with the categories.

Phase 2: Move python only libraries into there own repository under github.com/python/stdlib. During release build of cpython, this repository is bundled with CPython, and included in either a single folder, or in seperate folders with the sys.path set accordingly.

Phase 3: The new stdlib repository can have its own release cycle, seperate from cpython, and its seperate function groups can be packaged into wheels. This allows python core distributions to be created, and allow the rest of the standard library to be installed as needed. For example, an import error could be raised indicating the proper command to install the extra required libraries.

But why would it? It seems you’re underestimating the synchronization costs that incurs. Now there are two separate but closely dependent repos. CI must ensure that compatible versions of those repos are tested. Packagers and maintainers must ensure that cross-compatibility is correctly documented (and upheld). Developers will sometimes need to submit synchronized PRs for both repos (because sometimes, to change something in the stdlib, you also need to change something in the core types or interpreter runtime, or vice-versa). Users must reason about both the runtime version and the stdlib version. Ancillary resources like the developer’s guide must grow dedicated sections for each project/repository.

1 Like

As my 2 cents, would be useful for alternative VMs to be able to tag the stdlib version supported/tested. This can ease upgrades by doing a test matrix with new ones.

For CPython, what about always installing/testing on the same release version? E.g. CPython 3.9.1 can imply stdlib 3.9.1

Then DodgyPython 0.2 can start trying stdlib 3.5.0 and move forward to 3.9.1 only when they got the syntax needed to support this version of CPython stdlib features

(Btw, thanks for PEP 399. It helps a lot)

I realize that the phases I proposed are increasingly controversial. Don’t get me wrong, I’m a monorepo fan now, while I was for split repositories in the past. I understand the extra work related with seperate repositories (like you mentioned, configuration management and developers struggling with mutliple repositories to name a few). This basically boils down to the question: does the the python standard library belong to the CPython implementation?

I don’t know, but I don’t remember seeing developers of alternate Python implementations contribute substantially to the stdlib.

Is a pure-Py version of a C-existing stdlib module a desired contribution? I am trying to port C to Py of _codecs, but what would be the benefit to CPython to submit it?

Makes no sense to alternate implementors to send pure-python versions that only alternate implementations would benefit. Unless you say it to be desired, in face of e.g. PEP-399

Question: Can having a clear separation between CPython and the stdlib can increase contributions to the latter?

We actually granted repo access to select people from other VMs in order to help fix compatibility issues but they never used those abilities. I think they basically had their workflow already worked out and found it easier to just stick with them rather than learn ours and contribute upstream.

From a CPython perspective we don’t need it, but if the alternative VMs could use it then it’s a potential conversation to have (I don’t know if that specific module in pure Python is beneficial or if they would still rather have it natively implemented for performance).

That’s an open question that I think we’re all trying to think through whether the answer is “yes” or not.

It makes a lot of sense. I mean, if CPython chooses to implement in C, Grumpy will do in Go and RustPython in Rust. However, having slow Py versions available benefit “newcomers” highly, in my opinion. Having a working thing occurs earlier. We can optimize later if the project got traction.

CPython moves syntax faster than I can port to Grumpy. Keeping sync with stdlib right now looks unfeasible.

A new VM workflow starts by copying anything *.py because is very needed to focus on the Py language syntax only. In this aspect, Grumpy gatthered from stdlib then PyPy then ouroboros. Having a single repo to include as submodule would be useful.

Example: Grumpy had no support to unpacking on except: clauses, nor multiple objects on with. When stdlib is refactored, stuff breaks. This can be mitigated having a stdlib test structure independent of CPython. Pluggable would be the best.

However, I understand CPython to have no benefit from it, besides diversity friendlyness. So their effort should be the minimum needed.

So I guess my question is whether a pure Python version is feasible to be used by an alternative implementation since codecs is such a fundamental module? If the answer is “yes” then I think a more formal discussion on python-dev might be called for to discuss whether the overall dev team thinks it’s worth taking on the responsibility of maintaining a pure Python version. But if the pure Python version is just for illustrative purposes then I don’t know if the maintenance burden is worth it.