Modernize and add missing features to pygettext

Python has a few helpful i18n tools in the Tools/i18n directory one of which being pygettext, which extracts translatable strings from source files. This tool is also mentioned in the gettext documentation.

Unfortunately, nowadays the feature set of pygettext is severely lacking compared to other widely used tools such as pybabel and GNU’s xgettext.

Namely, support is missing for ngettext, pgettext, etc., extracting programmer comments, message flags, custom marking keywords with multiple arguments and more. This makes the tool unusable in many projects which rely on this functionality.

There has been some talk about this in the last few years:

pygettext itself says:

“Python equivalent of xgettext(1)”

pygettext attempts to be option and feature compatible with GNU
xgettext where ever possible.

Given the state of the tool and what it aims to achieve, I propose to modernize it in order to be more in line with other tools that are used nowadays.

This would (could) entail:

  • Expanding the default keyword support to ngettext, pgettext and others
  • Allowing custom marking keywords with multiple arguments (using funcname:1c,2,3 syntax)
  • Extracting programmer comments (there’s an old PR that implements it: pygettext: extract translators comments · Issue #42361 · python/cpython · GitHub)
  • Supporting python-format and python-brace-format flags for better checks of msgstr
  • Better text wrapping support
  • Using argparse instead of getopt for better argument checking & error messages
  • Using the AST parser instead of the current tokenizer-based approach - this will greatly simplify the code and make it more robust
  • Updating and adding tests
  • Adding more CLI options to better align with the functionality of pybabel and xgettext
  • Documenting the CLI options

Of course, the list is by no means complete and not all of the points are necessary in order to improve pygettext. I deal with i18n in my job so having an up-to-date tool in Python itself would be great in any case.

One more point which was raised here is making the i18n scripts easier to access. One option was making gettext a module and accessing the scripts via python -m analogous to json.tool.

I was wondering what the community thinks about this proposal. I’d be happy to take the lead on this if we agree that it is worthwhile :slight_smile:

3 Likes

I could bring these over from GitHub - Cog-Creators/redgettext: A slightly modified version of pygettext for Red-DiscordBot (https://github.com/Cog-Creators/Red-DiscordBot) (see also the open PR), that repo is under GPL but I’m the author of the AST rewrite, translator comments handling, and support for more keywords so I’m free to license it under CPython’s license. It will need slight adjustments for sure but it would be a good start.

2 Likes

Had a quick a look at the code and it looks pretty similar to a small POC I hacked over the weekend :smiley: Definitely a good starting point!

If we end up going through with this, the best would probably be starting with the AST parser and updating/expanding the tests at the same time. After that, it’ll be much easier to add any missing features.

Is there a reason to keep pygettext around? I’ve been out of the i18n game for a long while so I don’t know what the current state in and out of Python is. For historical context, I wrote the original pygettext because xgettext didn’t support Python, and I needed it to add translations to GNU Mailman. It’s been practically unmaintained for years, and I have no personal interest in working on it any more.

Maybe it’s time we removed all of Tools/i18n from the CPython source [1]? If others are interested in maintaining any of the scripts in that directory, we could split them out into their own repos, and release them separately.


  1. though I’d propose waiting until 3.13 ↩︎

In its current state probably not, however I’d argue that message extraction is an important part of i18n which Python already provides in the gettext module. Thus, having pygettext in the toolbox goes in the spirit of ‘batteries included’ as this would allow one to localize applications without the need of any additional libraries.

It’s also true that xgettext now supports Python, but xgettext is not available on every platform (like Windows) so that’s not always a viable alternative.

pygettext (and msgfmt) is also recommended in the documentation but I don’t know if that’s a reason enough in and of itself…

Overall though I think that the effort required to improve pygettext is small enough to justify keeping it around.

If so, perhaps others will provide PRs and reviews to keep it viable.

I’d be happy to submit PRs (and I guess @jack1142 as well) as long as someone’s willing to review them :grin:

Would you happen to know if there’s someone currently maintaining i18n/gettext/locale (or anyone else) who we could contact?

1 Like

There isn’t an i18n team for Python AFAIK. You can open tickets and submit PRs and tag me but I’m not making any promises :stuck_out_tongue_winking_eye:

2 Likes

Alright, I’ll give it a go :smile::crossed_fingers:

1 Like

Just adding weight to either update pygettext or update the documentation to clearly point out the limitations. I’ve scratched my head a good 30 minutes trying to understand why my pgettext calls was not extracted using pygettext.

The possibility that the tool simply did not support the feature was not an option I even considered :slight_smile:

While xgettext support python now, it needs --keyword=pgettext:1c,2 to extract pgettext calls.

1 Like

Agreed, the way the documentation is written you’d never know that a vital part of what you’d expect from the tool is missing.

btw, if you’d like a python alternative to xgettext, babel can extract pgettext (and others) by default :slight_smile: