PEP 594, take 2: Removing dead batteries from the standard library

If the uu module and uu_codec are being removed, should binascii.a2b_uu and binascii.b2a_uu also go? The PEP references binascii, but it doesn’t mention those functions.

1 Like

If uuencoding support is being removed, then all support should be removed. Note that the uu_codec in the encodings package uses binascii for the actual work.

Whether or not to remove support is debatable, I guess. Most tools will use MIME and base64 nowadays. There could still be old data encoded using uuencoding, though. For codecs, we normally keep support around for encodings which are still in use, but the uu codec has been special since Python 3 due to being a bytes-to-bytes codec. It’s probably not used much these days.

1 Like

Apparently uuencoding is still used in SEC filings. See Bug: `binascii.a2b_uu` incorrectly assumes padded bytes are always whitespace · Issue #100308 · python/cpython · GitHub for a truly frightening example.

I looked around on PyPI, but could not find a package which could be used replacement for the uu codec.

Given that such data is still apparently still widespread in rather important databases such as SEC Edgar, this would be a reason to perhaps only deprecate the uu module and leave the codec and binascii support in place.

SEC data goes back as far as 1994, so it’s not surprising that uuencoding was being used. Base64 was only standardized in 2003 (RFC 3548). However, it is a bit surprising that they haven’t updated the blobs to e.g. base64 since then.

Update: Looks like the base64 encoding itself was already formalized as part of MIME in RFC 2045 and first mentioned in RFC 1421, dating back to 1996 and 1993 resp. See Base64 - Wikipedia

1 Like

If there’s one thing I’ve learned working in the financial industry, it’s that old file formats never go away.

I’d be okay with just getting rid of the uu module and leaving the rest in place.

1 Like

The binascii functions don’t raise DeprecationWarning in 3.11, so they should not be removed on the PEP 594 schedule. The question is if they should be deprecated in 3.12.

Looks like they shouldn’t. Either way, it’d be good to update the uu and/or binascii docs.

2 Likes

All I’ve seen explicitly mentioned on this thread and elsewhere as to actual continuing active use of this legacy encoding format is the single >20 year old EDGAR database. Could you share some of the other examples that indicate that its use is still widespread?

While it doesn’t necessarily capture proprietary-only uses, a grep.app search for the binascii (a2b|b2a)_uu functions which searches all of GitHub only found 76 hits, of which all but ≈12 were just vendored copies of the stdlib, and at least half of the remaining were in tests, of stdlib or third-party code.

Also, for reference, a grep.app search for the already-deprecated import uu, there are 46 hits for import uu (plus 1 for from uu), of which only ≈7 are not vendored copies of the stdlib, and most are test code (and 1 for EDGAR). By comparison, there are around 25 000 + 6000 hits for import base64 + from base64.

At least from what I’ve seen, I’m not sure maintaining first-party support for an ancient, non-standardized and all-but-obsolete format in the standard library forever is a course of action that either best serves either the core development team, it’s remaining legacy users or the rest of the Python userbase, as opposed to moving it a modular PyPI package maintained by folks that actually still use it.

For the Python core team, this avoids the burden of needing to maintain the code in perpetuity, particularly given it has at least one open security vulnerability awaiting a response (python/cpython#99889).

For the likely overwhelming majority of Python who don’t use it, directly or indirectly, it avoids any doubt about whether they should, and slims down the code and simplifies documentation distributed to every user.

And for those that may still actually need it, this puts the folks most qualified and motivated to develop and maintain it—its remaining active users—in full control of doing so, without having to go through the whole stdlib development and release process, and allowing their users to gain access to fixes and enhancements immediately or at their own pace, independent of a whole new Python version—likely particularly helpful for the legacy systems, applications and use cases still using it.

2 Likes

But that’s not the choice we have here. No-one has offered to maintain a PyPI package for this. So the choice is whether we maintain support for uuencoding in the stdlib, or drop such support with no replacement.

There may well be arguments for dropping without replacement (the obvious one being "if no-one is willing to maintain a 3rd party copy, it can’t be that important - although I consider that argument to be flawed, personally). But we should be clear on what we’re arguing for, and not mistakenly claim that dropping stdlib support leaves whatever existing users there are with a straightforward alternative.

Do we need a straightforward alternative? <10 public projects using a functionality sounds like the support isn’t worth the distraction of core dev’s focus.

An alternative is to use the std-lib’s version vendored (as many projects are apparently already doing). The license seems permissive for such case.

Not to mention the module will still be supported for at least 7 years after any deprecation.

2 Likes

Yes, indeed—the choice to be made by CPython and the core team would be to deprecate the functionality and eventually remove the functionality, whereas the potential replacement with a third party package (copied from the stdlib, or otherwise) is an option its remaining users would have to take the lead on exercising, if they are still interested in it, just like any other module deprecated for removal by PEP 594. To note, though, I’d be willing to help point them toward the code in question and advise them on how to do so.

To be clear, the choice to deprecate both uu and the related support code without a stdlib replacement was already made and approved in PEP 594, which states in part:

The uu codec is provided by the binascii module. There’s also encodings/uu_codec.py which is a codec for the same encoding; it should also be deprecated.

Given the extremely small direct usage numbers at least in non-proprietary code, even compared to a number of the other modules that were deprecated for removal in PEP 594, and only one identified ongoing use case thus far, there does not appear at least to me to be a compelling reason to reverse this decision.

However, formally deprecating the remaining ancillary code for at least two full feature versions will provide a strong notification to any remaining users, giving them the opportunity to migrate to an alternative (which I’m personally willing to help support) or report the disruption the change would cause, giving us more compelling data that could justify temporarily or indefinitely deferring the removal.

binascii.a2b_uu is still used in the email package so if we remove it, we’ll have to vendor that functionality into message.py so as not to break backward compatibility. We’ve already done that with sndhdr (in email/mime/audio.py) and imghdr (in email/mime/image.py). AFAICT, email doesn’t use binascii.b2a_uu.

1 Like

Per my analysis on the PR here and here, it seems fairly that only a small number of user packages could be potentially affected (those using a non-default option of a specific function of the legacy email API), and only then in specific circumstance processing certain very old non-MIME emails with uuencode attachments, and with the impact being that the attachment is merely left as uuencoded text rather than binary data, rather than a hard error or an impact to message content.

Raising a deprecation warning would give us a much better idea of how much likely this would occur in practice, and what the actual impact would be; if it was determined to be significant, then I could potentially help with the vendoring.

The EDGAR database is one of the central SEC databases, so it’s not just any old database. It’s also not >20 years old, but gets updated on a daily basis. Since they have been running the database for >20 years, it is not surprising to find old data and old data formats in that database.

Another large data collection using the encoding are the Usenet archives: Internet Archive: Digital Library of Free & Borrowable Books, Movies, Music & Wayback Machine The uuencoding was used as main transport for binary data on Usenet in the early days. Even thought those archives are stored as mbox files, they are still full of uuencoding parts. You can even find it used in postings as recent as 2011 (eg. in alt.sources).

So the assumption of the encoding not being in active use anymore is obviously not quite right.

That’s also the reason why we normally don’t remove codecs from Python: there will most likely still be data around which people may want to read in and process using Python, esp. data coming from databases and legacy systems. Maintenance of those codecs usually isn’t hard (encodings don’t change much after they have been standardized), so it’s not a real burden.

And no, a search on Github won’t work to give us more insight into how widespread an encoding is, since it only covers open source code which may want to process such data, but doesn’t give any insight into how much encoded data is still in active use.

This refers to the uu module, not the codec. I’m fine with deprecating and removing the uu module.

Perhaps someone will start a project on PyPI to replace the uu codec. Once there’s such an alternative, we could reconsider deprecating the built-in uu codec and binascii uuencoding code parts.

3 Likes

I agree with @malemburg 's assessment. Consider also that like Usenet archives, some folks decades of old email content that may contain uuencoded data … like very likely mail.python.org :smiley:

1 Like

msilib

The msilib package is a Windows-only package. It supports the creation of Microsoft Installers (MSI). The package also exposes additional APIs to create cabinet files (CAB). The module is used to facilitate distutils to create MSI installers with the bdist_msi command. In the past it was used to create CPython’s official Windows installer, too.

Microsoft is slowly moving away from MSI in favor of Windows 10 Apps (AppX) as a new deployment model [3].

(emphasis mine)

I am interested in this characterization. The (7 year old!) reference given in the PEP says:

The Future of Windows Installer (MSI) …
…MSI isn’t going to go away…
…MSI is going to be around for some while…

which to me seems to contraindicate that msilib is a ‘dead battery’ on technical terms; if the format is sticking around, it will continue to be useful for whatever tasks people currently do with it [1].

Now, if msilib is considered a dead battery for non-technical reasons, that’s another story. If Python core wants to help encourage people to use AppX (and find support for doing that somewhere else) or if there aren’t enough use(r)s of it relative to its maintenance burden, I totally understand wanting to get rid of it, but the PEP may benefit from saying so in more explicit terms.

Or perhaps, I missed something in my reading of the intent to drop msilib, in which case I hope someone will point it out. :sweat_smile:


[1] An “edit an existing MSI” task is what sent me to this thread

The blot post is 7 years old, so it isn’t really representative of where Microsoft may be taking things.

Maybe, but it isn’t widely useful enough to warrant keeping in the stdlib. We have plenty of other modules we cut with more users because its use was simply not widespread enough for us to have shoulder the burden of keeping it functioning.

At this point it isn’t really important as the PEP has been accepted and implemented, making the PEP a historical document.

3 Likes

14 posts were split to a new topic: Maintaining the chunk module after it has been removed from the standard library