(I don’t have a proposal here - I just want to start a discussion.)
I can remember many years ago - if I had to guess, probably somewhere around the release of 3.4 - trying to use codecs.open in 2.7 and build myself a wrapper based on it, to backport the way that 3.x’s new open worked. I don’t think I ever figured it out - at any rate, the base codecs.open just isn’t quite right. The problem is that the codecs require the file to be opened in binary mode (despite all the weird bytes ↔ str implicit conversions 2.x did), so you don’t get universal newline support. If you have a file with \r\n line endings then you just get an extra \r at the end of each line you read. But if you have a file with \r ilne endings then it doesn’t properly iterate over lines at all.
Fast forward to today, and codecs.open has the same quirk. Meanwhile, I see about 160k uses of it with GitHub’s code search, of which:
- about 135k explicitly specify a UTF-8 encoding
- about 1k specify an ISO-something-or-other encoding (most of these are probably iso-8859-1, i.e. Latin-1)
- a few hundred specify a code-page encoding (unsurprisingly, the popular ones are 1252, 1251, and 437; surprisingly, I couldn’t find 65001 in this context at all)
- almost everything else does something more complicated, but is almost certainly selecting a text encoding.
When I tried searching for any of the byte-transform encodings I got barely any results, and most of them were false positives anyway. For example, I found one legitimate example using the bz2 encoding (and then the resulting bytes get passed to BeautifulSoup which does its usual coercion trick).
In short: overwhelmingly, people are using it for something that can trivially be converted to use the builtin open instead. Perhaps this mostly happens for 2.7 compatibility, but such code hides a flaw that doesn’t seem to be well recognized.
I’m not surprised that the binary-transform codecs are rarely-used, actually, given that they represent functionality available elsewhere in the standard library. Reaching for the base64 standard library module has always seemed a lot more obvious to me than codecs.open, for example.
As for the text-transform codecs - well, there’s only one, and I genuinely can’t understand why rot13 is provided as a codec. It doesn’t really demonstrate anything about how to use the codec system (since most of the codecs convert between text and bytes). It can’t be used as an encoding parameter, nor specified in str.encode nor bytes.decode; worse yet, bytes.decode suggests using codecs.decode that also doesn’t work.
Even the one famous example of Python using rot-13, doesn’t use the codec. Yet, apparently enough people complained for it to get added back in 3.2 along with the binary transforms. It seems weird as the odd-one-out bit of string-and-or-bytes-utility-manipulation that isn’t directly provided another way. And wrapping it up as a codec doesn’t make any sense to me: the underlying encoder and decoder are duplicates of each other, and the other parts of the codec interface seem to be completely unusable, e.g.:
>>> codecs.lookup('rot13').streamreader(io.StringIO('foo')).read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<frozen codecs>", line 503, in read
TypeError: can't concat str to bytes
What about other possible uses for codecs?
-
I touched on
codecs.encodeandcodecs.decodealready. These do get used a fair bit (10k+ each), and a fair fraction of them are for binary transforms. Oddly enough, ROT-13 shows up a bit too - mainly, it seems, to obfuscate email addresses insetup.py. (Users of ROT-13 don’t seem to agree on which direction is “encoding” and which is “decoding”.) But again, this just doesn’t seem like the most pleasant interface (although perhaps it’s better than remembering names likeunhexlifyandb64decode). -
The other thing I want to mention is
codecs.register. The interface here is needlessly awkward. Even when people go to the trouble of making custom codecs, overwhelmingly they just want to map one specific name to one specific codec. Having to write a function for this (and have it return the codec for one specific string input andNoneotherwise) is overkill, and there clearly are multiple different ways people go about making that wrapper function. And anyway, it seems like the large majority of uses here are just to alias existing codecs - especially to back-port the aliasing ofcp65001toutf-8for <=3.7, or to accommodate non-Windows platforms by aliasingmbcsto something else.
Personally what I don’t like about the codecs interface is that I can’t chain them together. That would have seemed like the point of offering those binary transforms: so they could wrap a stream and be composed together and do lazy processing of something that had been compressed or encoded multiple ways. But it’s really just nowhere near as convenient as you’d hope.
Bonus: maddeningly, the hex codec is defined such that “decoding” (reading from the file) is converting from hex digits (in bytes!) to raw bytes. So you can’t even use it to read an arbitrary binary file and make a hex dump, which would seem like the most obvious use case.