Re-use of standard library across implementations

Is it? It looks just like a dumb copy of the stdlib from a few years ago. Take a look at e.g. io.py, lzma.py or os.py: they rely quite a bit on the existence of C extension modules not provided in the source tree.

To implement this, I would propose several phases of refactoring.

Phase 1: categorization. Within the CPython repository, split the standard library into multiple folders, each containing a specific functional group of the standard library. Then, during the building process of the installer these libraries must be copied back into a single folder, or the sys.path variable must be extended to include the several folders with the categories.

Phase 2: Move python only libraries into there own repository under github.com/python/stdlib. During release build of cpython, this repository is bundled with CPython, and included in either a single folder, or in seperate folders with the sys.path set accordingly.

Phase 3: The new stdlib repository can have its own release cycle, seperate from cpython, and its seperate function groups can be packaged into wheels. This allows python core distributions to be created, and allow the rest of the standard library to be installed as needed. For example, an import error could be raised indicating the proper command to install the extra required libraries.

But why would it? It seems you’re underestimating the synchronization costs that incurs. Now there are two separate but closely dependent repos. CI must ensure that compatible versions of those repos are tested. Packagers and maintainers must ensure that cross-compatibility is correctly documented (and upheld). Developers will sometimes need to submit synchronized PRs for both repos (because sometimes, to change something in the stdlib, you also need to change something in the core types or interpreter runtime, or vice-versa). Users must reason about both the runtime version and the stdlib version. Ancillary resources like the developer’s guide must grow dedicated sections for each project/repository.

1 Like

As my 2 cents, would be useful for alternative VMs to be able to tag the stdlib version supported/tested. This can ease upgrades by doing a test matrix with new ones.

For CPython, what about always installing/testing on the same release version? E.g. CPython 3.9.1 can imply stdlib 3.9.1

Then DodgyPython 0.2 can start trying stdlib 3.5.0 and move forward to 3.9.1 only when they got the syntax needed to support this version of CPython stdlib features

(Btw, thanks for PEP 399. It helps a lot)

I realize that the phases I proposed are increasingly controversial. Don’t get me wrong, I’m a monorepo fan now, while I was for split repositories in the past. I understand the extra work related with seperate repositories (like you mentioned, configuration management and developers struggling with mutliple repositories to name a few). This basically boils down to the question: does the the python standard library belong to the CPython implementation?

I don’t know, but I don’t remember seeing developers of alternate Python implementations contribute substantially to the stdlib.

Is a pure-Py version of a C-existing stdlib module a desired contribution? I am trying to port C to Py of _codecs, but what would be the benefit to CPython to submit it?

Makes no sense to alternate implementors to send pure-python versions that only alternate implementations would benefit. Unless you say it to be desired, in face of e.g. PEP-399

Question: Can having a clear separation between CPython and the stdlib can increase contributions to the latter?

We actually granted repo access to select people from other VMs in order to help fix compatibility issues but they never used those abilities. I think they basically had their workflow already worked out and found it easier to just stick with them rather than learn ours and contribute upstream.

From a CPython perspective we don’t need it, but if the alternative VMs could use it then it’s a potential conversation to have (I don’t know if that specific module in pure Python is beneficial or if they would still rather have it natively implemented for performance).

That’s an open question that I think we’re all trying to think through whether the answer is “yes” or not.

It makes a lot of sense. I mean, if CPython chooses to implement in C, Grumpy will do in Go and RustPython in Rust. However, having slow Py versions available benefit “newcomers” highly, in my opinion. Having a working thing occurs earlier. We can optimize later if the project got traction.

CPython moves syntax faster than I can port to Grumpy. Keeping sync with stdlib right now looks unfeasible.

A new VM workflow starts by copying anything *.py because is very needed to focus on the Py language syntax only. In this aspect, Grumpy gatthered from stdlib then PyPy then ouroboros. Having a single repo to include as submodule would be useful.

Example: Grumpy had no support to unpacking on except: clauses, nor multiple objects on with. When stdlib is refactored, stuff breaks. This can be mitigated having a stdlib test structure independent of CPython. Pluggable would be the best.

However, I understand CPython to have no benefit from it, besides diversity friendlyness. So their effort should be the minimum needed.

So I guess my question is whether a pure Python version is feasible to be used by an alternative implementation since codecs is such a fundamental module? If the answer is “yes” then I think a more formal discussion on python-dev might be called for to discuss whether the overall dev team thinks it’s worth taking on the responsibility of maintaining a pure Python version. But if the pure Python version is just for illustrative purposes then I don’t know if the maintenance burden is worth it.

IIUC the main reason the private _codecs module is written in C is speed – we want encoding and decoding to be blazingly fast. The public codecs module is written in Python. I think it makes sense for each implementation to provide a minimal native version of _codecs (maybe implementing only UTF-8, Latin-1 and ASCII codecs – you need UTF-8 anyway and the other two are trivial) and leave the rest up to negotiations with the users. This technically leaves some undocumented “public” APIs out of the codecs module (e.g. utf_7_encode), but those seem to be undocumented, and user code ought to always use the encoding lookup mechanism rather than calling the encoding-specific functions directly on the codecs module.

I think so, but had not implemented it yet. As slow as it could be, will be an universally useful implementation anyway.

Maybe is not for the dev team the responsibility to code the 1st version of it. But would be nice if an upcoming pure version got maintained in sync with the C one, as stated on PEP399.

When trying to get Grumpy to do something useful, I pursued to allow python -m zipfile to work. It imports io, needing _pyio needing codecs needing _codecs. Then Grumpy fails.

I dont think it will be just illustrative, as even simple things need it.

The earlier Grumpy could get compatibility and usefulness, the earlier can focus on speed and choose what to rewrite in “native” Go code.

Yes. But maybe pure versions of stuff like this are yet useful.

But if the _codecs is not optional for codecs, can the later be considered pure in a pratical sense?

My proposal is to codecs to provide it even without _codecs. Or a _codecs.py having only this at least. The way it is right now, codecs is not usable in a pure-only environment.

Anyway I brought _codecs as an example only. The main point for me here is stdlib to ease integration, contributions and testing with non-CPython VMs. Can start by having a way for alternative VMs to target and maybe git submodule stdlib versions.

The @windelbouwman Stdlib/Halflings/Core separation seems as a nice evolution to this desire.

I am not against the idea of providing pure-Python module implementations, but feel codecs is not a particularly good example. Many modern programming languages (Go definitely included) have robust encoding support, and it is probably easier for each implementation to interface directly to the underlying standard libraries. The better approach for codecs specifically IMO would be to spec out the _codecs interface so alternative implementations can provide it for codecs. That alternative _codecs implementation might as well be written in Python, but I don’t think it’s CPython’s responsibility to maintain that.

1 Like

What about implementations based on Javascript? or even Micropython ones. Writing native code is often a pain and not pratical right now.

Is just an example. I agree that CPython should not get the burden to create
the pure version of existing C stuff. However, if someone contributes a pure version and a minimal suite of tests, I would expect CPython to keep it passing for both C and pure versions, by the PEP399:

Re-implementing parts (or all) of a module in C (in the case of CPython) is still allowed for performance reasons, but any such accelerated code must pass the same test suite (sans VM- or C-specific tests) to verify semantics and prevent divergence.

Discussing if _codecs is elligible for the “special permission” is another story, but we can pick some other lib if this eases the real discussion.

I am getting the impression to be monopolizing the voice here, so will step back. Before, let me please remember what are we trying to improve, if needing an improvement:

That’s actually fine. :slight_smile: Everyone is just trying to understand perspectives and needs here appropriately. You are discussing things and being nice and respectful while doing it so it’s totally fine if it’s primarily you and other folks talking; you have a perspective that the majority of people here don’t have.

Probably not. Basically I think we need to decide if we want to blanket accept pure Python versions of any/all modules lacking one if another VM has a use for it, or if we will be selective.

My key _codecs question

Is a pure Python version of _codecs’ functionality something grumpy – or any other alternative VM that is represented in this discussion – would actually use or would it be an academic exercise? (Guessing about e.g. " implementations based on Javascript" don’t count.)

My other question

Does someone have a list of modules which are missing a pure Python version which could reasonably be expected to have one (e.g. sqlite doesn’t count :wink:; assume ctypes isn’t available)?

In my opinion this would be used or at least start with, in any new VM. If all C modules had a Py version, Grumpy / RustPython / Batavia (JS) / VOC / Transcrypt (JS) / Micropython@Unix / etc would focus on SO integration like sockets or select or filesystem first, to archive a compatibility level. Then go for optimization. This would benefit all of it, as “Premature optimization is the root of all evil”. And lets assume that when you have a new VM, any pure module is a module that can be not rewrote until really needed.

But this is my opinion. I would like to hear from @corona10, @cclauss, @auvipy, @aisk, @pmp-p @freakboy3742 and others, if they want.

In Grumpy, I would like to have a working Flask demo as everyone asks everytime, no mater how slow. Not discovering C blockers.

I have not but can hack something via stdlib-list if desired, I guess. RustPython have not_impl_gen.py and whats_left.sh that can be adaptable.

Is this desirable?

It certainly isn’t a Pure Python standard library in it’s current state. I’ll be the first to admit that’s an aspirational goal for the project. In it’s current state, you are correct that it is almost entirely a copy of old (Python 3.4 IIRC) CPython sources.

I set up the project for exactly the same reason that this thread initially described - to manage a standard library for non-CPython Python implementations. I had two of them (Batavia and VOC), so breaking the common requirement into a project that they could share made sense. I was also hoping to be able to define what the minimal “native implementation” interface would be.

So - I wouldn’t point at Ouroboros as a solution - it’s more a confirmation from another party that the problem described this thread exists.

Since I was tagged on this; FWIW, this separation (and approach for new Python implementations) makes sense to me. Having the ability to test the modules (and, for that matter, the core language spec) against a compliance test suite would also be valuable - both in terms of validating that new Python implementation is behaving as expected, and when a native optimized version is available, that the native version is compliant.