methane
(Inada Naoki)
May 6, 2022, 9:04am
1
Currently, http.cookiejar uses locale encoding (issue ).
I want to stop using locale encoding here.
By RFC 6265 , cookie names and values must be ASCII.
But cookie text files may contain comments that are not required to be ASCII.
I googled and found this issue at youtube-dl.
opened 11:43AM - 08 Apr 18 UTC
closed 02:38PM - 08 Apr 18 UTC
invalid
external-bugs
## Please follow the guide below
- You will be asked some questions and reque… sted to provide some information, please read them **carefully** and answer honestly
- Put an `x` into all the boxes [ ] relevant to your *issue* (like this: `[x]`)
- Use the *Preview* tab to see what your issue will actually look like
---
### Make sure you are using the *latest* version: run `youtube-dl --version` and ensure your version is *2018.04.03*. If it's not, read [this FAQ entry](https://github.com/rg3/youtube-dl/blob/master/README.md#how-do-i-update-youtube-dl) and update. Issues with outdated version will be rejected.
- [x] I've **verified** and **I assure** that I'm running youtube-dl **2018.04.03**
### Before submitting an *issue* make sure you have:
- [x] At least skimmed through the [README](https://github.com/rg3/youtube-dl/blob/master/README.md), **most notably** the [FAQ](https://github.com/rg3/youtube-dl#faq) and [BUGS](https://github.com/rg3/youtube-dl#bugs) sections
- [x] [Searched](https://github.com/rg3/youtube-dl/search?type=Issues) the bugtracker for similar issues including closed ones
- [x] Checked that provided video/audio/playlist URLs (if any) are alive and playable in a browser
### What is the purpose of your *issue*?
- [x] Bug report (encountered problems with youtube-dl)
- [ ] Site support request (request for adding support for a new site)
- [ ] Feature request (request for a new functionality)
- [x] Question
- [ ] Other
---
```
C:\Windows\system32>youtube-dl --proxy "socks5://127.0.0.1:1080/" --cookies "D:/lynda download/cookies.txt" -v --all-subs -o "Lynda.com - %(playlist)s/%(chapter_number)s - %(chapter)s/%(playlist_index)s - %(title)s.%(ext)s" "https://www.lynda.com/Photography-tutorials/Advanced-Photography-Medium-Format-Digital-Cameras/647680-2.html"
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--proxy', 'socks5://127.0.0.1:1080/', '--cookies', 'D:/lynda download/cookies.txt', '-v', '--all-subs', '-o', 'Lynda.com - %(playlist)s/%(chapter_number)s - %(chapter)s/%(playlist_index)s - %(title)s.%(ext)s', 'https://www.lynda.com/Photography-tutorials/Advanced-Photography-Medium-Format-Digital-Cameras/647680-2.html']
c:\users\张心阳\appdata\local\programs\python\python36\lib\http\cookiejar.py:2061: UserWarning: http.cookiejar bug!
Traceback (most recent call last):
File "c:\users\张心阳\appdata\local\programs\python\python36\lib\http\cookiejar.py", line 2011, in _really_load
line = f.readline()
UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 663: illegal multibyte sequence
_warn_unhandled_exception()
Traceback (most recent call last):
File "c:\users\张心阳\appdata\local\programs\python\python36\lib\http\cookiejar.py", line 2011, in _really_load
line = f.readline()
UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 663: illegal multibyte sequence
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\张心阳\appdata\local\programs\python\python36\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "c:\users\张心阳\appdata\local\programs\python\python36\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\张心阳\AppData\Local\Programs\Python\Python36\Scripts\youtube-dl.exe\__main__.py", line 9, in <module>
File "c:\users\张心阳\appdata\local\programs\python\python36\lib\site-packages\youtube_dl\__init__.py", line 471, in main
_real_main(argv)
File "c:\users\张心阳\appdata\local\programs\python\python36\lib\site-packages\youtube_dl\__init__.py", line 438, in _real_main
with YoutubeDL(ydl_opts) as ydl:
File "c:\users\张心阳\appdata\local\programs\python\python36\lib\site-packages\youtube_dl\YoutubeDL.py", line 411, in __init__
self._setup_opener()
File "c:\users\张心阳\appdata\local\programs\python\python36\lib\site-packages\youtube_dl\YoutubeDL.py", line 2291, in _setup_opener
self.cookiejar.load()
File "c:\users\张心阳\appdata\local\programs\python\python36\lib\http\cookiejar.py", line 1784, in load
self._really_load(f, filename, ignore_discard, ignore_expires)
File "c:\users\张心阳\appdata\local\programs\python\python36\lib\http\cookiejar.py", line 2063, in _really_load
(filename, line))
http.cookiejar.LoadError: invalid Netscape format cookies file 'D:/lynda download/cookies.txt': 'www.telerik.com\tFALSE\t/\tFALSE\t1679055717\tki_t\t1521288091697%3B1521288091697%3B1521289317673%3B1%3B8'
```
this problem only occurs when Im using youtube-dl under python3
Under python2, I can download the full course perfectly
while I use
```
D:\>python3
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys; print(sys.getdefaultencoding())
utf-8
>>>
```
as above, python3 gives me a default encoding as UTF8
I don’t know what tool this user used to create their cookie.txt.
So I am not sure what was the non-ASCII part actually.
This issue was reported in 2018. And youtube-dl support UTF-8 cookie file in 2020.
committed 09:21PM - 04 May 20 UTC
+ Add support for UTF-8 in cookie files
* Skip malformed cookie file entries ins… tead of crashing (invalid entry len, invalid expires at)
How do you think about using UTF-8 in http.cookijar like youtube-dl?
2 Likes
malemburg
(Marc-André Lemburg)
May 6, 2022, 9:56am
2
The closest I could find to spec for the Netscape cookie file format is this curl document:
https://curl.se/docs/http-cookies.html
The purpose of the http.cookiejar module is to read and write to such a text file (in various formats), so I’d opt for reading the file as Latin-1 to avoid any encoding issues and then remove any comments. The rest must then be plain ASCII as per the RFCs or an error is raised. When writing the file, plain ASCII should be used.
Using UTF-8 won’t really help us in this case.
1 Like
vstinner
(Victor Stinner)
May 6, 2022, 10:28am
3
If the default is changed, please add an encoding parameter to let the caller selects the encoding to allow loading a cookiejar with the same encoding than Python 3.10: encoding=sys.getfilesystemencoding()
.
If you want to force the caller to chose, ASCII encoding would be a good default. But honestly, it’s super annoying to have to chose the encoding.
RFC and other standards are nice, but “in the wild”, people do random things which don’t respect them like HTTP Headers in UTF-8 rather than Latin-1. IMO switching cookiejar to UTF-8 by default is more backward compatible than using Latin1, since most operating systems use UTF-8 as the locale encoding (it’s mostly Windows which uses other 8-bit encodings like cp1252, no?).
2 Likes
malemburg
(Marc-André Lemburg)
May 6, 2022, 10:42am
4
The point is that all actual data in a cookie file has to be ASCII according to the RFCs. It’s only comments which can break this.
Also note that the files can potentially be read and written by other applications, which may have different ideas about encoding.
Since Python’s cookiejar module is only interested in the actual data, the comments are irrelevant, so by making it read any ASCII compatible encoding and making sure only ASCII content is written, we should get the best compatibility with external tools.
1 Like
methane
(Inada Naoki)
May 6, 2022, 11:00am
5
Thank you for comments.
Now I think Latin-1 is the perfect encoding for cookiejar.
For cookie data, latin-1 is byte transparent between read & write cookie file.
WSGI uses latin-1 to encode/decode HTTP header. So byte transparent between HTTP request/response and cookie file.
For reading comments, latin-1 won’t cause UnicodeDecodeError.
CookieJar just ignore comments. No need to worry about mojibake.
For writing comments, CookieJar doesn’t support writing user comments in cookie.txt. No need to care.
1 Like
storchaka
(Serhiy Storchaka)
May 6, 2022, 2:23pm
6
In Latin-1, bytes \x85
and \xa0
are whitespaces. The code that uses str.strip()
or Unicode regular expressions with \s
handles them incorrectly.
Rather than choosing an encoding that just happens to map all byte values, why not use a different error handler? If non-ASCII is only allowed in comments, then it won’t matter if they are skipped/replaced/escaped, provided they just don’t raise.
3 Likes
Another option is to open the file in binary mode and only convert to string after stripping comments.
5 Likes
malemburg
(Marc-André Lemburg)
May 11, 2022, 9:01am
9
Good suggestions, Steve and Ronald.
There certainly are multiple ways to achieve the same outcome: read the raw file in some way, remove the comments, process the rest as ASCII, fail if the rest is not ASCII.
methane
(Inada Naoki)
May 11, 2022, 9:13am
10
Steve Dower:
Rather than choosing an encoding that just happens to map all byte values, why not use a different error handler? If non-ASCII is only allowed in comments, then it won’t matter if they are skipped/replaced/escaped, provided they just don’t raise.
RFC 6265 says:
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
; US-ASCII characters excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
But it is possible to write non-ASCII character in the header.
We need to consider balance between backward compatibility, compatibility with other tools, and security. But I am not expert of many HTTP tools in the world.
I quick looked what Go does. It ignores cookie values that is not valid cookie-octet.
Maybe, strict is better for security.
methane
(Inada Naoki)
May 11, 2022, 9:16am
11
It is not backward compatible. See source code.
r"""HTTP cookie handling for web clients.
This module has (now fairly distant) origins in Gisle Aas' Perl module
HTTP::Cookies, from the libwww-perl library.
Docstrings, comments and debug strings in this code refer to the
attributes of the HTTP cookie system as cookie-attributes, to distinguish
them clearly from Python attributes.
Class diagram (note that BSDDBCookieJar and the MSIE* classes are not
distributed with the Python standard library, but are available from
http://wwwsearch.sf.net/):
CookieJar____
/ \ \
FileCookieJar \ \
/ | \ \ \
MozillaCookieJar | LWPCookieJar \ \
| | \
| ---MSIEBase | \
This file has been truncated. show original
FileCookieJar.load() opens file and pass it to self._really_load()
.
Subclasses implements _really_load()
.
Some third party tools would implement subclass of FileCookieJar that overrides only _really_load
.
So we need to open text file, not binary.
That’s too bad. That probably makes Steve’s suggestion of using a different error handler a better option, the file could be opened with the ascii encoding and the surrogate escape error handler to be able to parse all files and still recognise non-ascii values without treating some of them incorrectly as whitespace.
3 Likes