Re-use of standard library across implementations

windelbouwman · July 28, 2019, 7:35pm

Hi all,

As a rustpython developer, I would like to be able to share parts of the python standard library in an easy way. We are now in the process of copying libraries from the cpython sourcecode base into our source repository, but this has a clear downside in that the libraries will get out of sync eventually.

I drew a schematic of a possible solution to this. This involves splitting the standard library, as it is now, into several subfolders which contain the different layers of the standard library.

Image:

The idea is to split the Lib folder into several seperate folders. This way, the standard library folder can be shared easily. The halfling modules are still manual work, since they are implemented in half python, half X, where X is the implementation language of choice.

Furthermore, we might want to seperate the standard library in so called good, bad and ugly folders. This idea is from gstreamer, where they have grouped plugins in this manner. We could move the old and deprecated modules into the ugly folder.

I’m curious to your opinions on this idea! Feedback more then welcome.

dholth · July 28, 2019, 8:39pm

I wrote a thing to do wheels for each stdlib module. Perhaps added metadata with per implementation conditional dependencies could be used by a special installer to build a complete stdlib with alternative implementations for the nodules that need it. E.g. adopt the “rp” implementation tag for wheels that only apply in rust python. http://github.com/dholth/nonstdlib

njs · July 28, 2019, 9:12pm

You’ll want to talk to the PyPy devs. They’re been dealing with this for a long time so might have suggestions. And any proposal to restructure CPython to make alternative implementations easier will be much more compelling if it multiple alternative implementations all agree that it would work for them.

I believe that right now they use a merge-based workflow: they have a special branch containing pristine snapshots of the CPython stdlib, and from time to time they import a new snapshot and then merge it forward into their main development branch. This means they can carry local changes to the stdlib.

It would also be interesting to look at those local changes and see how well they would work in your prepared scheme, e.g. are all the patches in the “core” layer?

steven.daprano · July 29, 2019, 12:33am

Hi all,

As a rustpython developer, I would like to be able to share parts of
the python standard library in an easy way. We are now in the process
of copying libraries from the cpython sourcecode base into our source
repository, but this has a clear downside in that the libraries will
get out of sync eventually.

Sharing the standard library across implementations implies that
anyone who has write permissions to any implementation has write
permissions for all implementations.

And that implies that everyone has write access to the shared part
of the std lib, since anyone can say “I’m developing the DodgyPython
implementation, I need shared write access.”

The assumption I’m making here is that you’re not proposing that we
become gatekeepers for who is, and isn’t, granted write access to other
implementations. I don’t think that it is either desirable or
practical to ask us to act as gatekeepers for write access across the
whole ecosystem of existing and future implementations.

[…]

The idea is to split the Lib folder into several seperate folders.
This way, the standard library folder can be shared easily.

I’m sorry, it isn’t clear to me how splitting the Lib folder into
seperate folders makes it easier to share than a single folder. The
hard part is to share the first folder: once you’ve solved the
practical problems of solving one folder, adding extra folders
requires very little additional work.

The distinction between hybrid modules (Python/C, Python/Java etc) makes
obvious sense. E.g. since Python’s re implementation is in C, Java/Rust
etc implementations probably won’t use it.

But your top level split that divides the pure-Python std lib into
“Good”, “Bad” and “Ugly” seems pointless and unnecessarily insulting to
the authors of the “Bad” and “Ugly” modules.

h-vetinari · July 29, 2019, 11:30am

I don’t think that implication holds. Saving the developers of DodgyPython the work of reimplementing the python-only part of the stdlib does not mean they have to have write access.

pitrou · July 29, 2019, 2:56pm

Hi @windelbouwman,

So, for full disclosure, this topic has already been discussed in the past. The main issue holding it back has been the lack of a driving force to present a full-fledged proposal (would most certainly be a PEP, perhaps even multiple ones?) and then drive it through adoption - and of course also implement it

By the time PyPy needed something like this, it probably became easier for them to have their own patching logic (I don’t know if they have dedicated scripts - though I suspect they do?) rather than try to go through the PEP adoption process, with all the discussions such as complex proposal would entail.

As @njs said, you probably want to contact the PyPy devs and get their suggestions - they probably have ample experience on the topic

windelbouwman · July 29, 2019, 6:54pm

@steven.daprano thank you for your comments!

This implication is not true. The developers of DodgyPython can use the shared standard library, but do not require write access to it. They can copy paste the folder into their own sourcecode, or they could bundle it upon installer/package creation.

Your fully right about this. This will insult other people, therefore it is not a good categorization. I still think that some form of categorization of the standard library into subfolders makes sense. As is listed now in the documentation, modules are also grouped by usecase. For example internet protocol support modules, multimedia modules. The reason for me to group them into folders is to group the modules, and make them easy to navigate. Also, implementations might choose to include / exclude certain groups. This makes the selection of the standard library more granular and easily customizable.

Thanks @njs, I will do this.

windelbouwman · July 29, 2019, 7:04pm

Is this previous attempt located somewhere? How should a PEP work? Could I author this PEP?

pitrou · July 29, 2019, 7:17pm

I think there have been multiple attempts actually (only discussions though). The only one I can find at the moment is the following, perhaps other people can find the other ones: https://mail.python.org/pipermail/python-dev/2016-July/145500.html

windelbouwman · July 29, 2019, 7:48pm

Another example can be taken from the openembedded python package. They take the python sourcecode, and split the library into logical groups, which can be installed seperately. If the standard library was organized as such, this packaging could be eased.

https://git.openembedded.org/openembedded-core/tree/meta/recipes-devtools/python/python3/python3-manifest.json

dholth · July 29, 2019, 7:52pm

That openembedded manifest is wonderful.

brettcannon · July 29, 2019, 8:25pm

Perhaps, but honestly how many people actually ever navigate a CPython source checkout? And even if you do, what is the motivation to browse by file instead of by documentation? And doing it by physical directory makes any changes later more difficult compared to documentation changes. Basically you’re working against 29 years of development habits to get this changed.

But you’re assuming that’s something we want to promote. We already have enough issues with Linux distros like Debian leaving out stuff like venv so I’m not sure if we would want to promote having people claim they support Python while missing significant parts of the stdlib (MicroPython/CircuitPython get away with this since their execution environment is so different from CPython’s). I mean maybe we would be okay with promoting it, but we have not had that discussion.

Basically you would propose a PEP outlining how you would want things to change. E.g. do you want a different folder structure? Do you want to break the stdlib out to its own repository?

Yep! Make sure to read PEP 1 and PEP 12.

njs · July 29, 2019, 8:47pm

This thread is also highly relevant:

dholth · July 29, 2019, 9:53pm

IMO if you do split the standard library into (n) wheels, and get Python applications into the habit of declaring dependencies per standard library module, you could improve the Debian situation. You would detect the missing module before your application started. Imagine a https://pypi.org/project/shiv/ zip application but with a potential automatic pipx step on first run.

steven.daprano · July 30, 2019, 1:00am

steven.daprano:

And that implies that everyone has write access to the shared part
of the std lib, since anyone can say “I’m developing the DodgyPython
implementation, I need shared write access.”

I don’t think that implication holds. Saving the developers of
DodgyPython the work of reimplementing the python-only part of the
stdlib does not mean they have to have write access.

And Windel Bouwman likewise objected:

The developers of DodgyPython can use the shared standard library, but
do not require write access to it. They can copy paste the folder into
their own sourcecode, or they could bundle it upon installer/package
creation.

But that’s precisely what they can do now. If they don’t have write
access, it’s not shared access, is it?

The status quo is that any Python implementation can re-use the Python
only parts of the std lib, all they need to do is “copy paste the
folder into their own sourcecode”, just as you say.

If that’s the only problem you want to solve, the Time Machine strikes
again and its already solved. (At least for the portion of the stdlib
that is in pure Python.)

But just as you said, copying is a one-off process, and the two copies
will eventually get out of sync. To avoid that:

each implementation has to periodically refresh their copy of the
stdlib from the CPython version;
and submit any changes they make to their copy back to CPython;
and hope that CPython accepts the changes.

In other words, in the world we live in today, CPython’s version of the
std lib is the “master copy” and has a priviledged position as the One
True version of the modules. Anyone can submit PRs to modify the stdlib,
but implementations other than CPython have no priviledged status.

Sharing the code implies that those who share it have equal status. It
won’t be just CPython that has the One True version, all implementations
will have equal write access. Otherwise, it’s not shared, it’s just
copied, which is what they can do now.

If PyPy modify a module, it will automatically be seen by all Python
implementations, not just PyPy.

(I’m not saying that the changes will appear by magic. Presumably they
will appear in seperate branches of seperate repos, or something like
that. The details of interoperability between repos will presumably need
working out. The technical details might be hard to solve, or easy to
solve, I don’t know.)

But if PyPy has write permission to the shared parts of the repo, they
effectively have write permission to the shared parts of everyone’s
repos, since that’s what shared means.

And the same applies to DodgyPython unless we act as gatekeepers,
splitting the world of Python interpreters into “trusted” and
“untrusted” implementations.

(By the way, in case it wasn’t obvious, I made up the name “DodgyPython”
to avoid singling out any actual existing implementation as untrusted.)

guido · August 1, 2019, 5:24pm

You’ll never get enough of the ecosystem to declare their dependencies of specific parts of the stdlib. I am sympathetic to your problem, but the solution should not require any changes for existing Python users (as long as they aren’t relying on the physical layout of the stdlib in the filesystem – games with the default sys.path are to some extent acceptable). And neither is a solution feasible that provides backwards compatibility for some time while deprecating current habits (again excluding reliance on filesystem layout).

That said, proposals that categorize the stdlib into multiple tiers could be useful for a variety of alternative Python implementations that are struggling with finding the resources to support or verify all of the stdlib. Not just PyPy has gone here before, Jython and IronPython are also in this boat, and there are new implementations just around the corner (I’ve heard of something named GrailPython out of Oracle, and there are always people playing with transpiling to JS or WASM).

In terms of your compatibility story you will have to play it similarly to MicroPython and CircuitPython: they claim full compatibility with syntax and builtins of a specific version of Python, but advertise clearly that they have a limited stdlib (not to mention less memory :-).

dholth · August 1, 2019, 8:16pm

Here’s what openembedded calls “core” Python.

Suppose Python moves all of these in a “core” directory in its source tree. A build step would recreate the current Lib/ directory by copying “core” and everything else into one directory. Bonus points if lib-dynload/ has alternative pure-python implementations.

Now alternative Python’s development process keeps copying core/ and maintains a patch set on top of core/. If two alternative Pythons could share the same system to maintain patched versions of the standard library that would be amazing.

Could be very nice and organized and even give CPython developers a better idea of what they were working on. Alternatively openembedded et al. might have a different idea of and prefer to copy everything into the old Lib/ directory before splitting? Anyone have a clearer vision?

openembedded’s core python:

   ${bindir}/python*[!-config]
   ${includedir}/python${PYTHON_BINABI}/pyconfig*.h
   ${prefix}/lib/python${PYTHON_MAJMIN}/config*/*[!.a]
    UserDict.py
    UserList.py
    UserString.py
    __future__.py
    _abcoll.py
    _bootlocale.py
    _collections_abc.py
    _markupbase.py
    _sitebuiltins.py
    _sysconfigdata*.py
    _weakrefset.py
    abc.py
    argparse.py
    ast.py
    bisect.py
    code.py
    codecs.py
    codeop.py
    collections
    collections/abc.py
    configparser.py
    contextlib.py
    copy.py
    copyreg.py
    csv.py
    dis.py
    encodings
    encodings/aliases.py
    encodings/latin_1.py
    encodings/utf_8.py
    enum.py
    functools.py
    genericpath.py
    getopt.py
    gettext.py
    heapq.py
    imp.py
    importlib
    importlib/_bootstrap.py
    importlib/_bootstrap_external.py
    importlib/abc.py
    importlib/machinery.py
    importlib/util.py
    inspect.py
    io.py
    keyword.py
    lib-dynload/__pycache__/_struct.*.so
    lib-dynload/__pycache__/binascii.*.so
    lib-dynload/__pycache__/time.*.so
    lib-dynload/__pycache__/xreadlines.*.so
    lib-dynload/_bisect.*.so
    lib-dynload/_csv.*.so
    lib-dynload/_heapq.*.so
    lib-dynload/_opcode.*.so
    lib-dynload/_posixsubprocess.*.so
    lib-dynload/_struct.*.so
    lib-dynload/array.*.so
    lib-dynload/binascii.*.so
    lib-dynload/math.*.so
    lib-dynload/parser.*.so
    lib-dynload/readline.*.so
    lib-dynload/select.*.so
    lib-dynload/time.*.so
    lib-dynload/unicodedata.*.so
    lib-dynload/xreadlines.*.so
    linecache.py
    locale.py
    new.py
    ntpath.py
    opcode.py
    operator.py
    optparse.py
    os.py
    platform.py
    posixpath.py
    re.py
    reprlib.py
    rlcompleter.py
    selectors.py
    signal.py
    site.py
    sitecustomize.py
    sre_compile.py
    sre_constants.py
    sre_parse.py
    stat.py
    stringprep.py
    struct.py
    subprocess.py
    symbol.py
    sysconfig.py
    textwrap.py
    threading.py
    token.py
    tokenize.py
    traceback.py
    types.py
    warnings.py
    weakref.py

nas · August 1, 2019, 9:29pm

Something like this is the approach I would suggest. I.e. find all Python modules that can potentially run on other Python VMs. Leave those in Lib and move the rest to some other folder.

In the short term, Lib can remain as part of the CPython repo. Or, we could make it a git submodule. In the long term, I suspect it would be good if Lib could become its own repo with its own release cycle. Having it not so intimately tied to CPython would provide value to the Python community.

Yes. This reduces the barrier of writing an alternative Python VM and that’s a good thing. Also, forcing pure-python versions helps ensure that the module is possible to implement in pure Python (e.g. avoid using low-level CPython features or quirks).

brettcannon · August 1, 2019, 9:42pm

This is already required via PEP 399, so it’s only older modules which no one has bothered to put the time and effort into writing a pure Python port which still need this. But also realize that there has been some pushback in the past in doing this for older modules as CPython doesn’t need it and PyPy already has their RPython equivalents (and they aren’t needed by MicroPython).

I would hope that if we either broke out the stdlib into its own repo or pushed to make it more portable across Python implementations that people wouldn’t object and people would then be up for doing the work.

energizer · August 3, 2019, 8:39pm

Also of possible interest, https://github.com/beeware/ouroboros is (edit: claims to be) a pure-Python implementation of the standard library.