Python has a few helpful i18n tools in the Tools/i18n directory one of which being pygettext, which extracts translatable strings from source files. This tool is also mentioned in the gettext documentation.
Unfortunately, nowadays the feature set of pygettext is severely lacking compared to other widely used tools such as pybabel and GNU’s xgettext.
Namely, support is missing for ngettext, pgettext, etc., extracting programmer comments, message flags, custom marking keywords with multiple arguments and more. This makes the tool unusable in many projects which rely on this functionality.
There has been some talk about this in the last few years:
pygettext itself says:
“Python equivalent of xgettext(1)”
pygettext attempts to be option and feature compatible with GNU
xgettext where ever possible.
Given the state of the tool and what it aims to achieve, I propose to modernize it in order to be more in line with other tools that are used nowadays.
This would (could) entail:
Expanding the default keyword support to ngettext, pgettext and others
Allowing custom marking keywords with multiple arguments (using funcname:1c,2,3 syntax)
Supporting python-format and python-brace-format flags for better checks of msgstr
Better text wrapping support
Using argparse instead of getopt for better argument checking & error messages
Using the AST parser instead of the current tokenizer-based approach - this will greatly simplify the code and make it more robust
Updating and adding tests
Adding more CLI options to better align with the functionality of pybabel and xgettext
Documenting the CLI options
Of course, the list is by no means complete and not all of the points are necessary in order to improve pygettext. I deal with i18n in my job so having an up-to-date tool in Python itself would be great in any case.
One more point which was raised here is making the i18n scripts easier to access. One option was making gettext a module and accessing the scripts via python -m analogous to json.tool.
I was wondering what the community thinks about this proposal. I’d be happy to take the lead on this if we agree that it is worthwhile
Had a quick a look at the code and it looks pretty similar to a small POC I hacked over the weekend Definitely a good starting point!
If we end up going through with this, the best would probably be starting with the AST parser and updating/expanding the tests at the same time. After that, it’ll be much easier to add any missing features.
Is there a reason to keep pygettext around? I’ve been out of the i18n game for a long while so I don’t know what the current state in and out of Python is. For historical context, I wrote the original pygettext because xgettext didn’t support Python, and I needed it to add translations to GNU Mailman. It’s been practically unmaintained for years, and I have no personal interest in working on it any more.
Maybe it’s time we removed all of Tools/i18n from the CPython source [1]? If others are interested in maintaining any of the scripts in that directory, we could split them out into their own repos, and release them separately.
In its current state probably not, however I’d argue that message extraction is an important part of i18n which Python already provides in the gettext module. Thus, having pygettext in the toolbox goes in the spirit of ‘batteries included’ as this would allow one to localize applications without the need of any additional libraries.
It’s also true that xgettext now supports Python, but xgettext is not available on every platform (like Windows) so that’s not always a viable alternative.
pygettext (and msgfmt) is also recommended in the documentation but I don’t know if that’s a reason enough in and of itself…
Overall though I think that the effort required to improve pygettext is small enough to justify keeping it around.
Just adding weight to either update pygettext or update the documentation to clearly point out the limitations. I’ve scratched my head a good 30 minutes trying to understand why my pgettext calls was not extracted using pygettext.
The possibility that the tool simply did not support the feature was not an option I even considered
While xgettext support python now, it needs --keyword=pgettext:1c,2 to extract pgettext calls.
In terms of compatibility, PO files are a well-defined format and the goal is for pygettext to produce output that is compatible with both xgettext and babel (which is also the case currently). Since I’m more familiar with babel, I was actually looking into reusing babel’s tests instead, but the bigger priority now is fixing the many bugs in pygettext.
When it comes to feature parity, the biggest discrepancy is in the CLI options which are much more limited in pygettext. For instance, there is no way to define custom keywords which is something I’d like to add. In fact, it is only relatively recently that pygettext can extract ngettext/pgettext/etc… at all.
For extraction itself, once pygettext switches to a parser, we’ll be able to fix, in theory, all extraction-related bugs, some of which are present even in xgettext/babel.