Using UTF-8 for cookie.txt

methane · May 6, 2022, 9:04am

Currently, http.cookiejar uses locale encoding (issue).
I want to stop using locale encoding here.

By RFC 6265, cookie names and values must be ASCII.
But cookie text files may contain comments that are not required to be ASCII.

I googled and found this issue at youtube-dl.

github.com/ytdl-org/youtube-dl

Not sure if it's a bug in youtube-dl or cookiejar.py GBK codec can't decode

opened 11:43AM - 08 Apr 18 UTC

closed 02:38PM - 08 Apr 18 UTC

HoldOnBro

invalid external-bugs

## Please follow the guide below - You will be asked some questions and reque…sted to provide some information, please read them **carefully** and answer honestly - Put an `x` into all the boxes [ ] relevant to your *issue* (like this: `[x]`) - Use the *Preview* tab to see what your issue will actually look like --- ### Make sure you are using the *latest* version: run `youtube-dl --version` and ensure your version is *2018.04.03*. If it's not, read [this FAQ entry](https://github.com/rg3/youtube-dl/blob/master/README.md#how-do-i-update-youtube-dl) and update. Issues with outdated version will be rejected. - [x] I've **verified** and **I assure** that I'm running youtube-dl **2018.04.03** ### Before submitting an *issue* make sure you have: - [x] At least skimmed through the [README](https://github.com/rg3/youtube-dl/blob/master/README.md), **most notably** the [FAQ](https://github.com/rg3/youtube-dl#faq) and [BUGS](https://github.com/rg3/youtube-dl#bugs) sections - [x] [Searched](https://github.com/rg3/youtube-dl/search?type=Issues) the bugtracker for similar issues including closed ones - [x] Checked that provided video/audio/playlist URLs (if any) are alive and playable in a browser ### What is the purpose of your *issue*? - [x] Bug report (encountered problems with youtube-dl) - [ ] Site support request (request for adding support for a new site) - [ ] Feature request (request for a new functionality) - [x] Question - [ ] Other --- ``` C:\Windows\system32>youtube-dl --proxy "socks5://127.0.0.1:1080/" --cookies "D:/lynda download/cookies.txt" -v --all-subs -o "Lynda.com - %(playlist)s/%(chapter_number)s - %(chapter)s/%(playlist_index)s - %(title)s.%(ext)s" "https://www.lynda.com/Photography-tutorials/Advanced-Photography-Medium-Format-Digital-Cameras/647680-2.html" [debug] System config: [] [debug] User config: [] [debug] Custom config: [] [debug] Command-line args: ['--proxy', 'socks5://127.0.0.1:1080/', '--cookies', 'D:/lynda download/cookies.txt', '-v', '--all-subs', '-o', 'Lynda.com - %(playlist)s/%(chapter_number)s - %(chapter)s/%(playlist_index)s - %(title)s.%(ext)s', 'https://www.lynda.com/Photography-tutorials/Advanced-Photography-Medium-Format-Digital-Cameras/647680-2.html'] c:\users\张心阳\appdata\local\programs\python\python36\lib\http\cookiejar.py:2061: UserWarning: http.cookiejar bug! Traceback (most recent call last): File "c:\users\张心阳\appdata\local\programs\python\python36\lib\http\cookiejar.py", line 2011, in _really_load line = f.readline() UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 663: illegal multibyte sequence _warn_unhandled_exception() Traceback (most recent call last): File "c:\users\张心阳\appdata\local\programs\python\python36\lib\http\cookiejar.py", line 2011, in _really_load line = f.readline() UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 663: illegal multibyte sequence During handling of the above exception, another exception occurred: Traceback (most recent call last): File "c:\users\张心阳\appdata\local\programs\python\python36\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "c:\users\张心阳\appdata\local\programs\python\python36\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\张心阳\AppData\Local\Programs\Python\Python36\Scripts\youtube-dl.exe\__main__.py", line 9, in <module> File "c:\users\张心阳\appdata\local\programs\python\python36\lib\site-packages\youtube_dl\__init__.py", line 471, in main _real_main(argv) File "c:\users\张心阳\appdata\local\programs\python\python36\lib\site-packages\youtube_dl\__init__.py", line 438, in _real_main with YoutubeDL(ydl_opts) as ydl: File "c:\users\张心阳\appdata\local\programs\python\python36\lib\site-packages\youtube_dl\YoutubeDL.py", line 411, in __init__ self._setup_opener() File "c:\users\张心阳\appdata\local\programs\python\python36\lib\site-packages\youtube_dl\YoutubeDL.py", line 2291, in _setup_opener self.cookiejar.load() File "c:\users\张心阳\appdata\local\programs\python\python36\lib\http\cookiejar.py", line 1784, in load self._really_load(f, filename, ignore_discard, ignore_expires) File "c:\users\张心阳\appdata\local\programs\python\python36\lib\http\cookiejar.py", line 2063, in _really_load (filename, line)) http.cookiejar.LoadError: invalid Netscape format cookies file 'D:/lynda download/cookies.txt': 'www.telerik.com\tFALSE\t/\tFALSE\t1679055717\tki_t\t1521288091697%3B1521288091697%3B1521289317673%3B1%3B8' ``` this problem only occurs when Im using youtube-dl under python3 Under python2, I can download the full course perfectly while I use ``` D:\>python3 Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import sys; print(sys.getdefaultencoding()) utf-8 >>> ``` as above, python3 gives me a default encoding as UTF8

I don’t know what tool this user used to create their cookie.txt.
So I am not sure what was the non-ASCII part actually.

This issue was reported in 2018. And youtube-dl support UTF-8 cookie file in 2020.

How do you think about using UTF-8 in http.cookijar like youtube-dl?

malemburg · May 6, 2022, 9:56am

The closest I could find to spec for the Netscape cookie file format is this curl document:

https://curl.se/docs/http-cookies.html

The purpose of the http.cookiejar module is to read and write to such a text file (in various formats), so I’d opt for reading the file as Latin-1 to avoid any encoding issues and then remove any comments. The rest must then be plain ASCII as per the RFCs or an error is raised. When writing the file, plain ASCII should be used.

Using UTF-8 won’t really help us in this case.

vstinner · May 6, 2022, 10:28am

If the default is changed, please add an encoding parameter to let the caller selects the encoding to allow loading a cookiejar with the same encoding than Python 3.10: encoding=sys.getfilesystemencoding().

If you want to force the caller to chose, ASCII encoding would be a good default. But honestly, it’s super annoying to have to chose the encoding.

RFC and other standards are nice, but “in the wild”, people do random things which don’t respect them like HTTP Headers in UTF-8 rather than Latin-1. IMO switching cookiejar to UTF-8 by default is more backward compatible than using Latin1, since most operating systems use UTF-8 as the locale encoding (it’s mostly Windows which uses other 8-bit encodings like cp1252, no?).

malemburg · May 6, 2022, 10:42am

The point is that all actual data in a cookie file has to be ASCII according to the RFCs. It’s only comments which can break this.

Also note that the files can potentially be read and written by other applications, which may have different ideas about encoding.

Since Python’s cookiejar module is only interested in the actual data, the comments are irrelevant, so by making it read any ASCII compatible encoding and making sure only ASCII content is written, we should get the best compatibility with external tools.

methane · May 6, 2022, 11:00am

Thank you for comments.
Now I think Latin-1 is the perfect encoding for cookiejar.

For cookie data, latin-1 is byte transparent between read & write cookie file.
- WSGI uses latin-1 to encode/decode HTTP header. So byte transparent between HTTP request/response and cookie file.
For reading comments, latin-1 won’t cause UnicodeDecodeError.
- CookieJar just ignore comments. No need to worry about mojibake.
For writing comments, CookieJar doesn’t support writing user comments in cookie.txt. No need to care.

storchaka · May 6, 2022, 2:23pm

In Latin-1, bytes \x85 and \xa0 are whitespaces. The code that uses str.strip() or Unicode regular expressions with \s handles them incorrectly.

steve.dower · May 9, 2022, 6:50pm

Rather than choosing an encoding that just happens to map all byte values, why not use a different error handler? If non-ASCII is only allowed in comments, then it won’t matter if they are skipped/replaced/escaped, provided they just don’t raise.

ronaldoussoren · May 10, 2022, 8:23am

Another option is to open the file in binary mode and only convert to string after stripping comments.

malemburg · May 11, 2022, 9:01am

Good suggestions, Steve and Ronald.

There certainly are multiple ways to achieve the same outcome: read the raw file in some way, remove the comments, process the rest as ASCII, fail if the rest is not ASCII.

methane · May 11, 2022, 9:13am

RFC 6265 says:

 cookie-value      = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
 cookie-octet      = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
                       ; US-ASCII characters excluding CTLs,
                       ; whitespace DQUOTE, comma, semicolon,
                       ; and backslash

But it is possible to write non-ASCII character in the header.

We need to consider balance between backward compatibility, compatibility with other tools, and security. But I am not expert of many HTTP tools in the world.

I quick looked what Go does. It ignores cookie values that is not valid cookie-octet.

Maybe, strict is better for security.

methane · May 11, 2022, 9:16am

It is not backward compatible. See source code.

github.com

python/cpython/blob/main/Lib/http/cookiejar.py

r"""HTTP cookie handling for web clients.

This module has (now fairly distant) origins in Gisle Aas' Perl module
HTTP::Cookies, from the libwww-perl library.

Docstrings, comments and debug strings in this code refer to the
attributes of the HTTP cookie system as cookie-attributes, to distinguish
them clearly from Python attributes.

Class diagram (note that BSDDBCookieJar and the MSIE* classes are not
distributed with the Python standard library, but are available from
http://wwwsearch.sf.net/):

                        CookieJar____
                        /     \      \
            FileCookieJar      \      \
             /    |   \         \      \
 MozillaCookieJar | LWPCookieJar \      \
                  |               |      \
                  |   ---MSIEBase |       \

This file has been truncated. show original

FileCookieJar.load() opens file and pass it to self._really_load().
Subclasses implements _really_load().

Some third party tools would implement subclass of FileCookieJar that overrides only _really_load.
So we need to open text file, not binary.

ronaldoussoren · May 11, 2022, 11:09am

That’s too bad. That probably makes Steve’s suggestion of using a different error handler a better option, the file could be opened with the ascii encoding and the surrogate escape error handler to be able to parse all files and still recognise non-ascii values without treating some of them incorrectly as whitespace.