Maintaining the chunk module after it has been removed from the standard library

I had half a mind to do something like this, this past summer, but it’s really overwhelming. I made barely any progress on my idea before giving up.

Hello.

It seems I sparked a larger conversation than I meant to. Sorry for making trouble. I have a suggestion that might please everyone. Perhaps the 20 dead batteries can become part of another package. It would still be distributed with Python, and thus offline users would still have access, but it would be off the core dev plate. This is a Python session I envision.

from community import chunk
*** Warning, this package is not maintained, test thoroughly before deploying. ***
chunk.open(…)

Once there is a viable replacement on PyPI, the original files can be deleted from the main community package, and the PyPI package can be listed as one of its dependencies.

This way, the community can take on responsibility for maintaining one or more sub packages. Including the community package in the build would then be simple.

Also, while I’m here, I’ve been Working on the module that started this thing off, chunk. It’s been a busy week, but I’m expecting to be done this weekend. Again, I’ll drop an update at that time.

Thanks again!

3 Likes

No need to apologise, this often happens and it’s sometimes beneficial to spark larger discussions.

A community package can certainly happen, all it requires is people to actually do it, and importantly, maintain it.

Good to hear, thank you!

2 Likes

Hello!

It took me forever, sorry about that, but I’m happy to announce, or at least whisper, the package I promised is up on PyPI. Links are below.

I know we’re now way off topic, so this will be my last post about this project on this thread, but feedback is welcome. Please be kind, most of my projects are directly distributed, so uploading to PyPI is new to me.

Thank you for your input.

PyPI: chunkmuncher · PyPI

Github: https://github.com/jscuster/chunkmuncher

8 Likes

Good work, and congratulations on your first PyPI package! :rocket: :balloon:

Feel free to open new threads if you have any questions.

4 Likes

Thank you! :raised_hands: I’ll do just that. :mailbox_with_mail:

1 Like

FWIW, I moved this to Core Development with the stdlib tag (which BTW any TL3+ (“Regular”) user like you can do, not just mods).

Its worth noting that similar ideas were discussed on and off throughout the PEP’s lifecycle, but not adopted, and something like this is actually mentioned as a rejected idea in the PEP. Offline users can still have access via either vendoring/copying the module code, or by unzipping/installing the wheel (downloaded from PyPI or built themselves).

This was also my first thought too back when the PEP was still being discussed, since AIFF is hardly any more obsolete than WAV, and the IFF format implemented by chunk underlies both (and nearly the complete chunk module implementation is just vendored in to the wave module anyway).

However, the actual real-world usage numbers (which I originally compiled and was going to post back when the OP’s comment was first made, but wasn’t able to before I had to go on a trip) tell a somewhat different story:

When searching import chunk across GitHub via Grep.app and manually checking all 119 hits, only 8 of them were for the actual chunk stdlib module, as opposed to another module with the same (common) name or not already vendored, and of those only 3 imports were actually used. The breakdown:

  • scripts: 1 (test file generator script for audio library)
  • library: 2 (Z machine interpreter, music programming language)
  • unused: 5 (2 script, 1 deprecated, 1 example app, 1 library)

For import aifc, there were 14 unique hits across GitHub (out of 41 total, including vendored copies of the stdlib and duplicate hits in the same project) at least some of which may be already incompatible with Python 3.13+, unused or otherwise unaffected (as I didn’t investigate any in detail).

By contrast, there were 1179 hits for import wave, which after spot-checking were nearly all genuine imports of the stdlib module.

When searching from chunk import Chunk (the only top-level name in the module), there were only 36 total hits, of which only 3 were current script/library usages (and thus possibly affected), 3 were explicitly for old Python versions and long-unmaintained, 2 were standalone, modified vendored copies of chunk and the rest were just part of a vendored stdlib. The breakdown:

  • script: 1 (Extract WAV files from SF2)
  • library: 2 (Amiga emulator, cassette reader for PCBasic emulator)
  • unmaintained: 2 (6 years - subtitle shifter, 3 years / Py 3.9 - soundbank reader for cozmo robot)
  • Python 2: 1 (1 script, Glulx VM profiler)
  • vendored/modified: 2 (Improved wave module for xbox tools, wave module extracted into sampler tool)

As for from aifc import, there was only a single hit, while for from wave import there were 15.

In total, that the usage ratio of wave to aifc is around 75:1, while for chunk it is nearly 200:1, given such a relatively small amount of actual usage of either module (around a half dozen to a dozen usages across all of GItHub), maintaining and distributing it to all users via the stdlib seems somewhat hard to justify, at least on an empirical basis. (Of course, this doesn’t capture all usage, but the relative numbers should be at least roughly reliable as a useful proxy absent a messive skew for using aifc or chunk over wave in non-public repos relative to public ones).

3 Likes

Thank you for doing the search on Github.

However, please note that there are plenty non-open-source uses of such modules as well, e.g. in scripts run locally by people in the movie or audio industry. See the note in the PEP regarding the aifc module.

In larger companies, it is often very difficult to get anything installed from PyPI. You typically have to request permission from IT, they have to screen the code, run it through security checks, possibly add it to a locally maintained package repo, etc. Information security is a big topic at big companies :slight_smile: (and rightly so).

This is why “batteries included” is such a big win for Python. The level of trust put into the Python stdlib is far higher than what’s typically put into some random package on PyPI (and again: rightly so).

The stdlib is not a good place for code which needs to change often or has many dependencies, but it’s perfect for code which is easy to maintain and hardly ever changes, such as the audio modules.

6 Likes

IMO, the chunk module could work as a tutorial in docs, explaining how to implement a file-like objects and parse binary data using struct.
Most of the additional complexity is related from config options needed for “not-quite-IFF” chunk formats (like the “bigendian” option that allows it to read RIFF), which muddle the scope of the library (e.g. why not add “checksum” and “swap ID and size” to support PNG?), but would be easy to add to tutorial code for any particular use case.

The question is whether the tutorial would be practical. In many cases all you need is 20 lines of a readchunk function that returns a dataclass with name and data as bytes… Well, maybe that would be a good first step for the tutorial.

2 Likes

If these large companies are so dependent on what’s in the stdlib, maybe they need to pay some dedicated core devs to maintain the modules they use? :wink:

2 Likes

You mean like the funding for three developer in residence roles?

I agree that we shouldn’t direct a bunch of volunteer effort towards things that are only of benefits to corporate users who don’t contribute back. But the point is that the effort here is small, and (some) companies really do provide a lot of support for core Python.

4 Likes

I wasn’t calling out every company, of course some provide considerable support (including dedicated time from their employees). Just making the point you spelled out–the “dead batteries” modules typically had no one with the time to maintain them, and if they’re important for corporate users those companies have resources for that.

The point here is that the maintenance burden for stable modules is very small. Compared to the net win Python gets by keeping them in the stdlib vs. requiring to download 3rd party packages from PyPI is much higher than the maintenance effort.

That said, we do need more of those big companies providing additional funding to the core dev team. Their supply chain security budget should easily make it possible to fund at least another 3 core dev in residence positions (and that’s a good investment, one that the SEC will also approve of).

4 Likes

This is true, and yet it’s also true that virtually nobody gets by these days with just the stdlib. If they haven’t got a system for installing third-party libraries, they’ll have an internal first-party library with all their helper code, or they’ll be getting it from a distributor (potentially as part of a larger application that ought to know already that its users are more likely to use certain modules).

Early communication and clearly indicating that the license allows anyone to just copy it into their codebase really ought to be enough (and we’ve done both of these). This is FOSS, we can’t install everything for everyone.

4 Likes

For larger applications, I agree, but for simple scripts, Python’s stdlib often is just fine.

Also note that we are discussing removal of a stdlib modules, not addition of new modules. That is: We are intentionally breaking simple scripts which have worked perfectly fine before - and due to the nature of such scripts, the breakage will usually only be found when you actually run the scripts, since they will often don’t come with tests and have a CI and release process, which would uncover the problem early.

Of course there are ways around all this like copying code and installing a venv just to run the script, adding things to local utility packages, etc. but they all require extra layers of process to get approved in larger companies (again, rightly so).

2 Likes

Very much this (in spite of PEP 723 making it easier to manage scripts with dependencies).

A comprehensive stdlib is, IMO, still crucial for the “Python as a powerful scripting language” use case.

5 Likes

What people also tend to ignore or overlook is that stdlib modules undergo much stricter testing than most third-party packages. C extension modules in the stdlib are tested on multiple platforms (all the buildbots, stable or unstable) and for reference leaks. Some of them are even fuzzed for robustness in the face of invalid or malicious input.

It’s easy to fool yourself into thinking that third-party packages are well-tested, but realistically very few of them reach the same level of quality assurance.

4 Likes

But not always…

Exhibit A: PR to remove the chunk module:

The module has no tests.

11 Likes

Hello everyone,

In a previous post, I said I’d refrain from discussing this specific project in this thread due to its focus on PEP594. However, with the conversation increasingly centering on the chunk module, I think it’s an apt moment for an update.

Exciting News on the Chunk Module:

  • Test Implementation: I’m thrilled to announce that the chunk module now includes tests. It’s currently a single test, but as someone new to writing publicly accessible tests, I’m looking forward to any feedback.
  • Upcoming Plans:
    1. Integration with the AIFC Module: The immediate plan involves incorporating the aifc module and developing tests for it, with subsequent updates pushed to the PyPI package.
    2. Wave Module Enhancement: We’ll take similar steps for the Wave module, focusing on updates and PyPI deployment.
    3. Code Optimization: I’m concentrating on eliminating redundant code in the aifc and wave modules, especially the internal versions of the Chunk class.
    4. Enhancing Chunk Functionalities: Next, we’ll port over aifc functionalities that support writing data to chunk.

Future Improvements Under Consideration:

  • Implementing @encukou’s suggestion for the “checksum” and “swap ID and size” functionality.
  • Enabling the modules to accept pathlib.Path objects in addition to strings.
  • Adding capabilities for reading/writing data in memory to facilitate chunk modifications without overwriting existing data.

This roadmap is ambitious, and while I welcome suggestions and adjustments, I must note that this project isn’t my primary responsibility. Progress might be slow, but it’s steady.

Going forward, I will be posting updates and engaging in discussions on this project in a dedicated thread to keep this one focused on PEP594. I appreciate your understanding and encourage those interested to join the conversation there.

Looking forward to evolving this project with your feedback and insights!

My sincerest thanks,
Jscuster

P.S. For more detailed information and future discussions on this project, please refer to the dedicated thread: Recharging Chunk-Related Dead Batteries - PEP594 Feedback Appreciated.

11 Likes

Thank you!