Using UTF-8 for cookie.txt

Currently, http.cookiejar uses locale encoding (issue).
I want to stop using locale encoding here.

By RFC 6265, cookie names and values must be ASCII.
But cookie text files may contain comments that are not required to be ASCII.

I googled and found this issue at youtube-dl.

I don’t know what tool this user used to create their cookie.txt.
So I am not sure what was the non-ASCII part actually.

This issue was reported in 2018. And youtube-dl support UTF-8 cookie file in 2020.

How do you think about using UTF-8 in http.cookijar like youtube-dl?

2 Likes

The closest I could find to spec for the Netscape cookie file format is this curl document:

https://curl.se/docs/http-cookies.html

The purpose of the http.cookiejar module is to read and write to such a text file (in various formats), so I’d opt for reading the file as Latin-1 to avoid any encoding issues and then remove any comments. The rest must then be plain ASCII as per the RFCs or an error is raised. When writing the file, plain ASCII should be used.

Using UTF-8 won’t really help us in this case.

1 Like

If the default is changed, please add an encoding parameter to let the caller selects the encoding to allow loading a cookiejar with the same encoding than Python 3.10: encoding=sys.getfilesystemencoding().

If you want to force the caller to chose, ASCII encoding would be a good default. But honestly, it’s super annoying to have to chose the encoding.

RFC and other standards are nice, but “in the wild”, people do random things which don’t respect them like HTTP Headers in UTF-8 rather than Latin-1. IMO switching cookiejar to UTF-8 by default is more backward compatible than using Latin1, since most operating systems use UTF-8 as the locale encoding (it’s mostly Windows which uses other 8-bit encodings like cp1252, no?).

2 Likes

The point is that all actual data in a cookie file has to be ASCII according to the RFCs. It’s only comments which can break this.

Also note that the files can potentially be read and written by other applications, which may have different ideas about encoding.

Since Python’s cookiejar module is only interested in the actual data, the comments are irrelevant, so by making it read any ASCII compatible encoding and making sure only ASCII content is written, we should get the best compatibility with external tools.

1 Like

Thank you for comments.
Now I think Latin-1 is the perfect encoding for cookiejar.

  • For cookie data, latin-1 is byte transparent between read & write cookie file.
    • WSGI uses latin-1 to encode/decode HTTP header. So byte transparent between HTTP request/response and cookie file.
  • For reading comments, latin-1 won’t cause UnicodeDecodeError.
    • CookieJar just ignore comments. No need to worry about mojibake.
  • For writing comments, CookieJar doesn’t support writing user comments in cookie.txt. No need to care.
1 Like

In Latin-1, bytes \x85 and \xa0 are whitespaces. The code that uses str.strip() or Unicode regular expressions with \s handles them incorrectly.

Rather than choosing an encoding that just happens to map all byte values, why not use a different error handler? If non-ASCII is only allowed in comments, then it won’t matter if they are skipped/replaced/escaped, provided they just don’t raise.

3 Likes

Another option is to open the file in binary mode and only convert to string after stripping comments.

5 Likes

Good suggestions, Steve and Ronald.

There certainly are multiple ways to achieve the same outcome: read the raw file in some way, remove the comments, process the rest as ASCII, fail if the rest is not ASCII.

RFC 6265 says:

 cookie-value      = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
 cookie-octet      = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
                       ; US-ASCII characters excluding CTLs,
                       ; whitespace DQUOTE, comma, semicolon,
                       ; and backslash

But it is possible to write non-ASCII character in the header.

We need to consider balance between backward compatibility, compatibility with other tools, and security. But I am not expert of many HTTP tools in the world.

I quick looked what Go does. It ignores cookie values that is not valid cookie-octet.

Maybe, strict is better for security.

It is not backward compatible. See source code.

FileCookieJar.load() opens file and pass it to self._really_load().
Subclasses implements _really_load().

Some third party tools would implement subclass of FileCookieJar that overrides only _really_load.
So we need to open text file, not binary.

That’s too bad. That probably makes Steve’s suggestion of using a different error handler a better option, the file could be opened with the ascii encoding and the surrogate escape error handler to be able to parse all files and still recognise non-ascii values without treating some of them incorrectly as whitespace.

3 Likes