Is the `codecs` module still useful?

kknechtel · June 27, 2024, 6:47pm

(I don’t have a proposal here - I just want to start a discussion.)

I can remember many years ago - if I had to guess, probably somewhere around the release of 3.4 - trying to use codecs.open in 2.7 and build myself a wrapper based on it, to backport the way that 3.x’s new open worked. I don’t think I ever figured it out - at any rate, the base codecs.open just isn’t quite right. The problem is that the codecs require the file to be opened in binary mode (despite all the weird bytes ↔ str implicit conversions 2.x did), so you don’t get universal newline support. If you have a file with \r\n line endings then you just get an extra \r at the end of each line you read. But if you have a file with \r ilne endings then it doesn’t properly iterate over lines at all.

Fast forward to today, and codecs.open has the same quirk. Meanwhile, I see about 160k uses of it with GitHub’s code search, of which:

about 135k explicitly specify a UTF-8 encoding
about 1k specify an ISO-something-or-other encoding (most of these are probably iso-8859-1, i.e. Latin-1)
a few hundred specify a code-page encoding (unsurprisingly, the popular ones are 1252, 1251, and 437; surprisingly, I couldn’t find 65001 in this context at all)
almost everything else does something more complicated, but is almost certainly selecting a text encoding.

When I tried searching for any of the byte-transform encodings I got barely any results, and most of them were false positives anyway. For example, I found one legitimate example using the bz2 encoding (and then the resulting bytes get passed to BeautifulSoup which does its usual coercion trick).

In short: overwhelmingly, people are using it for something that can trivially be converted to use the builtin open instead. Perhaps this mostly happens for 2.7 compatibility, but such code hides a flaw that doesn’t seem to be well recognized.

I’m not surprised that the binary-transform codecs are rarely-used, actually, given that they represent functionality available elsewhere in the standard library. Reaching for the base64 standard library module has always seemed a lot more obvious to me than codecs.open, for example.

As for the text-transform codecs - well, there’s only one, and I genuinely can’t understand why rot13 is provided as a codec. It doesn’t really demonstrate anything about how to use the codec system (since most of the codecs convert between text and bytes). It can’t be used as an encoding parameter, nor specified in str.encode nor bytes.decode; worse yet, bytes.decode suggests using codecs.decode that also doesn’t work.

Even the one famous example of Python using rot-13, doesn’t use the codec. Yet, apparently enough people complained for it to get added back in 3.2 along with the binary transforms. It seems weird as the odd-one-out bit of string-and-or-bytes-utility-manipulation that isn’t directly provided another way. And wrapping it up as a codec doesn’t make any sense to me: the underlying encoder and decoder are duplicates of each other, and the other parts of the codec interface seem to be completely unusable, e.g.:

>>> codecs.lookup('rot13').streamreader(io.StringIO('foo')).read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<frozen codecs>", line 503, in read
TypeError: can't concat str to bytes

What about other possible uses for codecs?

I touched on codecs.encode and codecs.decode already. These do get used a fair bit (10k+ each), and a fair fraction of them are for binary transforms. Oddly enough, ROT-13 shows up a bit too - mainly, it seems, to obfuscate email addresses in setup.py. (Users of ROT-13 don’t seem to agree on which direction is “encoding” and which is “decoding”.) But again, this just doesn’t seem like the most pleasant interface (although perhaps it’s better than remembering names like unhexlify and b64decode).
The other thing I want to mention is codecs.register. The interface here is needlessly awkward. Even when people go to the trouble of making custom codecs, overwhelmingly they just want to map one specific name to one specific codec. Having to write a function for this (and have it return the codec for one specific string input and None otherwise) is overkill, and there clearly are multiple different ways people go about making that wrapper function. And anyway, it seems like the large majority of uses here are just to alias existing codecs - especially to back-port the aliasing of cp65001 to utf-8 for <=3.7, or to accommodate non-Windows platforms by aliasing mbcs to something else.

Personally what I don’t like about the codecs interface is that I can’t chain them together. That would have seemed like the point of offering those binary transforms: so they could wrap a stream and be composed together and do lazy processing of something that had been compressed or encoded multiple ways. But it’s really just nowhere near as convenient as you’d hope.

Bonus: maddeningly, the hex codec is defined such that “decoding” (reading from the file) is converting from hex digits (in bytes!) to raw bytes. So you can’t even use it to read an arbitrary binary file and make a hex dump, which would seem like the most obvious use case.

MRAB · June 27, 2024, 7:48pm

Earlier today I used codecs.getwriter, so, yes, it’s still useful.

barry-scott · June 27, 2024, 8:19pm

A lot of web pages claim to be utf-8 but are in fact cp1251 etc.
The codecs are needed to turn such a page into unicode.

Legacy documents are often in a non-unicode encoding and the codecs module is required to read them.

wqking · June 29, 2024, 7:52am

The built-in open function often causes gbk related encode/decode error when reading text file in UTF-8 which contains Chinese characters. codec.open works without error.

kknechtel · June 29, 2024, 5:03pm

Why doesn’t it work to just specify an encoding parameter for open or for .decode/.encode methods? I’m not trying to question the usefulness of the actual codecs, but of the codecs module - i.e. the specific interface it provides.

Interesting; I’d be happy if you could show a minimal reproducible example. (If it’s necessary to create a binary file with exact byte values, showing a hex dump is fine.)

barry-scott · June 29, 2024, 9:20pm

All the cases I know of do not need the codecs module, they need the encoding to work in open() and encode() and decode(), as you have guessed correctly.

MRAB · June 29, 2024, 9:28pm

FTR, after reading your post, I did a quick search and I found an alternative to my use of getwriter that doesn’t use the codecs module.

cameron · June 29, 2024, 10:17pm

Could you post that please. I did a quick search of my stuff and the
only use I found was this:

 if getattr(main_log, 'encoding', None) is None:
       main_log = codecs.getwriter("utf-8")(main_log)

where main_log is an open file-like object. I don’t even remember how
I ended up with a file-like object with no .encoding attribute, and it
may be something which never happens these days.

MRAB · June 29, 2024, 11:01pm

I do a lot of coding in EditPad Pro which can run code and capture stdout, but Python thinks the encoding is cp1252 for some reason, so the workaround I’ve used for years, when needed, is:

import codecs
import sys
sys.stdout = sys.stdout.detach(codecs.getwriter('utf-8'))

Now I see that I can replace that with:

import sys
sys.stdout.reconfigure(encoding='utf-8')

kknechtel · June 30, 2024, 12:53am

Some ideas are piecing themselves together in my head for a possibly more pleasant and useful interface for this kind of functionality. Thinking I’ll try to make a PyPI package; hopefully I can name it something more inspired than codecs2…

barry-scott · June 30, 2024, 10:28am

Becuase that is the windows locale for the EditPad Pro process I would assume.
FYI that is why serving web pages from windows is often in cp1252. The programming using unicode in the code and puts encoding as utf-8 into the HTML. But windows (.net) converts from utf-8 to cp1252 when outputting stuff.
What the programmer did not know to do was tell .net that output must be in utf-8
not the default locale.

wqking · July 1, 2024, 12:33am

When doing you the example, I found my previous statement is not quite correct. The built-in open function often causes gbk related encode/decode error when the encoding argument is not specified, when I feed it with “utf-8”, there is no error any more. I think the error is caused by the system encoding.

Example, given temp.txt contains Chinese characters,

with open('temp.txt', 'r', encoding = 'utf-8') as file :
	file.read()

works.

with open('temp.txt', 'r') as file :
	file.read()

Raise ‘gbk’ related error.

With the encoding specified, there is no difference between open and codecs.open on this case…

kknechtel · July 1, 2024, 1:30am

Yes. It’s because the default is not utf-8, but whatever the system encoding is. If you have a utf-8 file, but your system is configured to use gbk, then it will fail with such an error (because the utf-8 data is not valid gbk data).

kknechtel · July 1, 2024, 2:06am

I found something interesting.

>>> with codecs.open('hax', encoding='utf-8') as f: print(list(f))
... 
['foo\r', 'bar\r', 'baz']
>>> with open('hax', 'rb') as f: print([l.decode('utf-8') for l in f])
... 
['foo\rbar\rbaz']
>>> with open('hax', encoding='utf-8') as f: print(list(f))
... 
['foo\n', 'bar\n', 'baz']

codecs.open, with an encoding specified, opens the file in binary mode. It then has some different logic: it doesn’t translate newlines (the way that “universal newline” mode does for files opened in text mode do now in 3.x); but it does recognize them (which is different from opening the file in binary mode “manually”, which can only recognize b'\n' as a “line ending”). It seems to use the string splitlines method for this.

So I think I got part of the first post wrong. Unless it’s changed since then