Which encodings do I really need on Windows?

bluebird75 · May 2, 2020, 9:43am

Hi guys,

I am currently packaging a PyQt application on Windows using PyInstaller. Given the important size of the generated directory (86 Mbytes), I am trying to punch as much as possible from the generated directory.

One area where I suspect I am embarking too much is the encoding module. My application is pretty straightforward in terms of encoding usage :

I use a Qt filesystem dialog to get the name of the file to open
I use python open/read/write to read and write my data files. They are composed exclusively of ascii data but I’ll probably open it as UTF8 to be on the safe side
it is possible to pass the name of the file to open on the command-line
the application is window based but in debug mode, I package it in console mode and print on the stdout.

My understanding is that I need the following encodings :

UTF8 for my file reading / writing
Qt string to Python string uses probably internal Python mechanism, not sur about that
file system encoding : if I read the documentation correctly, I only need mbcs and utf8 here. But is that really working on a Japanese user using japanese file names

The one where I am not clear are :

command-line, what encoding is used here ?
stdout, it uses the default codepage from what I understand, meaning that this is totally out of my control . But that’s only for debug purpose so I can perfectly skip that one.

So, can I limit myself to mbcs, utf8 and ascii ? My application works fine on my computer if I do this but that’s hardly a sign that it would work everywhere…

steve.dower · May 3, 2020, 8:13am

If you’re on Python 3.8, then that’s probably right. Once you control the standard streams, I don’t think there’s anywhere that implicitly uses any other encodings without giving you the option to override it (though the same cannot be said for libraries you may be using).

Have you calculated how much space you’ll save on encodings, though? I have checked it out before and didn’t think it was worthwhile. Python itself compresses down to about 7-8 MB in the embeddable distro, so if you’re at 10x that it’s probably because of your other dependencies.

bluebird75 · May 3, 2020, 10:36am

The two areas where I am not very knowledgable are really the filesystem encoding thing and the command-line. For the filesystem, the doc confirms that mbcs + utf8 is sufficient. For the command-line argument, I am still not sure what is used.

Trimming encodings saved me 384 kb. After removing several other things, I ended up with a 10 Mbytes executable. So that’s around 4%, it is neither negligible nor strongly impacting. I wrote an article in french 2 years ago about all the steps I took to move from 86 Mb to 10 Mb during packaging. The DLL and Qt in particular are the big culprits ( see https://linuxfr.org/users/bluebird/journaux/reduire-la-taille-des-executables-generes-avec-pyinstaller#comment-1751792 for the curious one).

By the way, I watched on youtube your presentation “Python on Windows is actually OK”. You are totally on spot, thank you for sharing those tips and making the actual situation more visible. I have been a Windows/Python developer for years ! Now that we can have windows support even on travis, things will become easier and easier for people like me!

uranusjr · May 3, 2020, 12:41pm

I believe Windows takes Unicode for command arguments (CreateProcess), so encoding is not an issue there, it’s all text. Things are muddy for input and output, but mbcs is still your best guess.

steve.dower · May 4, 2020, 7:53am

Not quite. If not started with the console (or with one of the legacy flags) it’ll try and detect the code page and select the matching encoding. mbcs is the explicit alias to get this behaviour, but it’s not what the console uses (I don’t know why), even though it is what the (legacy) filesystem encoding used. I don’t remember what the fallback behaviour is.

Probably just need to do some “chcp” changes and test. It’d be a real shame to break on other people’s machines if they’re not using the same code page.

uranusjr · May 4, 2020, 12:09pm

Thanks for the clarification. I was trying to say there is no way to easily tell what the best encoding is, so going with mbcs is the best option if the goal is to strip encodings, since you really can’t strip it down very much otherwise. (And there is always ctypes if you really need code page encoding.)

steve.dower · May 4, 2020, 1:09pm

And I was trying to say that CPython won’t [correction - may not] even run if the encoding for the current code page is missing and it needs it for stdin/out. OP already said he’ll use UTF-8 everywhere possible, which is fine, but you need the runtime to start to get that far

uranusjr · May 4, 2020, 1:31pm

I see, then I guess there’s not much doable here, basically all cpXXX encodings need to be kept if I’m not mistaken.

steve.dower · May 4, 2020, 2:13pm

Either that or carefully making sure you’ve overridden sys.std*, which may not be possible unless it was added into the newer initialization structures. I’d need to reread the sources.

bluebird75 · May 9, 2020, 2:10pm

Thanks for the clarification.

From the few experiments I ran, it is possible to pass as command-line argument a character which is not part of the current codepage, and have Python process it correctly. If I don’t force the encoding when writing this character to a file, I get a UnicodeEncodeError as I should. If I force to UTF8, the correct characters are created in the file.

So command-line argument parsing does not depend on the codepage. I seem to remember that it is linked to environment variable encoding. I’ll research that further…

steve.dower · May 9, 2020, 6:16pm

All of those will use the Unicode APIs, so they’ll be fine.

It’s just standard input/output when redirected to files that could be an issue.