Modernize and add missing features to pygettext

tomasr8 · May 3, 2023, 3:43pm

Python has a few helpful i18n tools in the Tools/i18n directory one of which being pygettext, which extracts translatable strings from source files. This tool is also mentioned in the gettext documentation.

Unfortunately, nowadays the feature set of pygettext is severely lacking compared to other widely used tools such as pybabel and GNU’s xgettext.

Namely, support is missing for ngettext, pgettext, etc., extracting programmer comments, message flags, custom marking keywords with multiple arguments and more. This makes the tool unusable in many projects which rely on this functionality.

There has been some talk about this in the last few years:

github.com/python/cpython

Make il8n tools available from `python -m`

opened 04:23PM - 07 May 19 UTC

80ec8472-4906-4627-bd2e-14667fa8ed0c

BPO | [36837](https://bugs.python.org/issue36837) --- | :--- Nosy | @warsaw, @ab…adger, @bbkane <sup>*Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.*</sup> <details><summary>Show more details</summary><p> GitHub fields: ```python assignee = None closed_at = None created_at = <Date 2019-05-07.16:23:26.059> labels = [] title = 'Make il8n tools available from `python -m`' updated_at = <Date 2019-05-11.15:26:52.551> user = 'https://github.com/bbkane' ``` bugs.python.org fields: ```python activity = <Date 2019-05-11.15:26:52.551> actor = 'a.badger' assignee = 'none' closed = False closed_date = None closer = None components = [] creation = <Date 2019-05-07.16:23:26.059> creator = 'bbkane' dependencies = [] files = [] hgrepos = [] issue_num = 36837 keywords = [] message_count = 5.0 messages = ['341771', '341876', '341999', '342005', '342198'] nosy_count = 3.0 nosy_names = ['barry', 'a.badger', 'bbkane'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = None url = 'https://bugs.python.org/issue36837' versions = [] ``` </p></details>

pygettext itself says:

“Python equivalent of xgettext(1)”

pygettext attempts to be option and feature compatible with GNU
xgettext where ever possible.

Given the state of the tool and what it aims to achieve, I propose to modernize it in order to be more in line with other tools that are used nowadays.

This would (could) entail:

Expanding the default keyword support to ngettext, pgettext and others
Allowing custom marking keywords with multiple arguments (using funcname:1c,2,3 syntax)
Extracting programmer comments (there’s an old PR that implements it: pygettext: extract translators comments · Issue #42361 · python/cpython · GitHub)
Supporting python-format and python-brace-format flags for better checks of msgstr
Better text wrapping support
Using argparse instead of getopt for better argument checking & error messages
Using the AST parser instead of the current tokenizer-based approach - this will greatly simplify the code and make it more robust
Updating and adding tests
Adding more CLI options to better align with the functionality of pybabel and xgettext
Documenting the CLI options

Of course, the list is by no means complete and not all of the points are necessary in order to improve pygettext. I deal with i18n in my job so having an up-to-date tool in Python itself would be great in any case.

One more point which was raised here is making the i18n scripts easier to access. One option was making gettext a module and accessing the scripts via python -m analogous to json.tool.

I was wondering what the community thinks about this proposal. I’d be happy to take the lead on this if we agree that it is worthwhile

jack1142 · May 3, 2023, 5:30pm

I could bring these over from GitHub - Cog-Creators/redgettext: A slightly modified version of pygettext for Red-DiscordBot (https://github.com/Cog-Creators/Red-DiscordBot) (see also the open PR), that repo is under GPL but I’m the author of the AST rewrite, translator comments handling, and support for more keywords so I’m free to license it under CPython’s license. It will need slight adjustments for sure but it would be a good start.

tomasr8 · May 3, 2023, 7:19pm

Had a quick a look at the code and it looks pretty similar to a small POC I hacked over the weekend Definitely a good starting point!

If we end up going through with this, the best would probably be starting with the AST parser and updating/expanding the tests at the same time. After that, it’ll be much easier to add any missing features.

barry · May 4, 2023, 3:12pm

Is there a reason to keep pygettext around? I’ve been out of the i18n game for a long while so I don’t know what the current state in and out of Python is. For historical context, I wrote the original pygettext because xgettext didn’t support Python, and I needed it to add translations to GNU Mailman. It’s been practically unmaintained for years, and I have no personal interest in working on it any more.

Maybe it’s time we removed all of Tools/i18n from the CPython source ^[1]? If others are interested in maintaining any of the scripts in that directory, we could split them out into their own repos, and release them separately.

though I’d propose waiting until 3.13 ↩︎

tomasr8 · May 4, 2023, 5:34pm

In its current state probably not, however I’d argue that message extraction is an important part of i18n which Python already provides in the gettext module. Thus, having pygettext in the toolbox goes in the spirit of ‘batteries included’ as this would allow one to localize applications without the need of any additional libraries.

It’s also true that xgettext now supports Python, but xgettext is not available on every platform (like Windows) so that’s not always a viable alternative.

pygettext (and msgfmt) is also recommended in the documentation but I don’t know if that’s a reason enough in and of itself…

Overall though I think that the effort required to improve pygettext is small enough to justify keeping it around.

barry · May 4, 2023, 8:58pm

If so, perhaps others will provide PRs and reviews to keep it viable.

tomasr8 · May 4, 2023, 9:36pm

I’d be happy to submit PRs (and I guess @jack1142 as well) as long as someone’s willing to review them

Would you happen to know if there’s someone currently maintaining i18n/gettext/locale (or anyone else) who we could contact?

barry · May 5, 2023, 12:44pm

There isn’t an i18n team for Python AFAIK. You can open tickets and submit PRs and tag me but I’m not making any promises

tomasr8 · May 5, 2023, 7:33pm

Alright, I’ll give it a go

olejorgenb · August 27, 2023, 9:37pm

Just adding weight to either update pygettext or update the documentation to clearly point out the limitations. I’ve scratched my head a good 30 minutes trying to understand why my pgettext calls was not extracted using pygettext.

The possibility that the tool simply did not support the feature was not an option I even considered

While xgettext support python now, it needs --keyword=pgettext:1c,2 to extract pgettext calls.

tomasr8 · August 30, 2023, 7:44am

Agreed, the way the documentation is written you’d never know that a vital part of what you’d expect from the tool is missing.

btw, if you’d like a python alternative to xgettext, babel can extract pgettext (and others) by default

maciek · February 2, 2025, 12:31pm

Is there a way to reuse GNU Gettext tests with pygettext to assure feature parity and compatibility, also in the future?

tomasr8 · February 3, 2025, 12:04pm

Hi!

In terms of compatibility, PO files are a well-defined format and the goal is for pygettext to produce output that is compatible with both xgettext and babel (which is also the case currently). Since I’m more familiar with babel, I was actually looking into reusing babel’s tests instead, but the bigger priority now is fixing the many bugs in pygettext.

When it comes to feature parity, the biggest discrepancy is in the CLI options which are much more limited in pygettext. For instance, there is no way to define custom keywords which is something I’d like to add. In fact, it is only relatively recently that pygettext can extract ngettext/pgettext/etc… at all.

For extraction itself, once pygettext switches to a parser, we’ll be able to fix, in theory, all extraction-related bugs, some of which are present even in xgettext/babel.