Drop supporting bytes on `sys.path`?

brettcannon · July 8, 2022, 10:29pm

bytes do not work on sys.path · Issue #91181 · python/cpython · GitHub is tracking the fact that bytes on sys.path accidentally got broken way back when. Since we have documented as supported I figured we should add it back. There’s a PR at gh-91181: restore support for bytes on sys.path in FileFinder by graingert · Pull Request #31897 · python/cpython · GitHub by @graingert to add it back in.

Unfortunately, the test suite is run with -bb. That is causing BytesWarning to be raised as an exception when looking into sys.path_importer_cache with a mix of bytes and strings. One solution is to stop using -bb so that this doesn’t cause a failure; it’s a legitimate check and catching the exception doesn’t make the in check work anyway. Another is to temporarily turn off warnings for BytesWarning in importlib in the key places where sys.path_importer_cache is checked, but that doesn’t fix the issue for anyone trying to use sys.path_importer_cache themselves.

The last option is to update the docs to says sys.path must be strings (or at least the built-in import system only supports strings). Since this support has been broke sometime between Python 3.2 and 3.6, it’s not worked for quite some time and wasn’t reported until March 2022. That would suggest it isn’t really missed.

I originally wanted to bring back the bytes support, but in writing this topic I realized it’s going to be messy for users to support directly, and so I’m now advocating dropping bytes support from the import system. Does anyone object to that?

gpshead · July 8, 2022, 11:09pm

Agreed. If it hasn’t worked since 3.6 and we haven’t heard anyone complain, lets just keep the simpler behavior and update the docs.

I’d only consider re-adding sys.path bytes support if something really painful from a user comes up during a pre-release related to our intention to make utf-8 the default (PEP 686 – Make UTF-8 mode default | peps.python.org targeting 3.15) for filesystem and io encodings.

vstinner · July 9, 2022, 1:51pm

In Python 3.0, using bytes was the only option to use paths which cannot be decoded from the Python filesystem encoding. Since Python 3.1 and PEP 383 (surrogateescape), using Unicode for paths give access to all paths and so supporting bytes paths is not longer needed. Unicode is better for portability: on Windows, many paths are not encodable to the ANSI code page. Well, since Python 3.6, Python now uses UTF-8 rather than the ANSI code page for paths on Windows, but still, Unicode remains least surprising and more convenient.

See my articles about Unicode and paths in Python:

For example, os.environb no longer makes sense and should be deprecated/removed.

sys.argv documentation explains how to retrieve original bytes argv: sys — System-specific parameters and functions — Python 3.12.0a0 documentation

vstinner · July 9, 2022, 1:52pm

I support that, in general, everywhere in the stdlib Using bytes for paths and filenames is bad and causes a lot of issues.

steven.daprano · July 10, 2022, 3:10am

If we remove support for bytes filenames, whether throughout the stdlib or just in the import system, can we please provide a recipe or FAQ for how to represent a byte path which is unrepresentable in the system encoding?

Consider an invalid UTF-8 path component like b’\xe5\xe6’ on a Linux file system. There are many ways such a file or directory could be created. How do I refer to that?

guido · July 10, 2022, 4:49am

There’s a range of Unicode code points reserved for that. Demonstration:

>>> import os
>>> open(b'\xe5\xe6', 'wb').close()
>>> os.listdir('.')
['\udce5\udce6']
>>>

steven.daprano · July 10, 2022, 6:11am

That’s great, thanks. I’m suggesting that if we do remove support for bytes paths, we document this trick somewhere easily visible in the docs.

pf_moore · July 10, 2022, 11:04am

It’s covered by PEP 383 – Non-decodable Bytes in System Character Interfaces | peps.python.org and the docs on the “surrogateescape” codec error handler, I believe.

vstinner · August 1, 2022, 12:23pm

They are many places in the Python documentation explaining how undecodable bytes are handled:

Obviously, there is always room for enhancements

vstinner · August 1, 2022, 12:25pm

I suggest to still accept bytes for “low-level” OS operations like open() or os.listdir().

IMO for sys.path, consistency matters more: it’s too surprising to get only str strings, but get sometimes a bytes string. So disallow bytes for filenames in the import system.