Accessibility of multilingual content with mixed translation

thibaudcolas · April 27, 2025, 11:28pm

not 100% sure this is the right forum category but here goes. I’d like the Python ecosystem to have more accessible multilingual content. Almost all Python projects I’m involved with use the gettext module and GNU gettext as the foundation of their translations (of any user interface, user docs, contributor docs). And so far all those projects I’ve done accessibility reviews of share the same issue(s), that seem to come either from a lack of gettext capabilities, or a lack of understanding of accessibility requirements.

The issue is – projects have content in mixed languages within the one web page, without annotating what language a given word or run of text is (with the lang HTML attribute). This is a problem for users of assistive tech. For example, speech synthesizers use this information to correctly pronounce words. The words are unintelligible if they’re pronounced in the wrong language. It’s a clear accessibility fail, and also arguably an inclusivity issue in that this only affects people who aren’t using the content’s source language.

This is described in the Web Content Accessibility Guidelines (WCAG) 3.1.2: Language of Parts (Level AA):

The human language of each passage or phrase in the content can be programmatically determined except for proper names, technical terms, words of indeterminate language, and words or phrases that have become part of the vernacular of the immediately surrounding text.

In addition to the issue for real-world users, failure to meet this aspect of WCAG also means falling short of legal requirements. For example Section 508 for the federal sector in the USA, European accessibility act for some of the private sector in the EU. And many more around the world.

Examples

First off if you want a good example, I’d recommend the WCAG 2.0 French translation, for example the CAPTCHA definition. But in addition, here are examples of real-world content with this issue across multiple Python projects:

Python 3.14 docs Glossary in French, very first item, The default Python prompt of […].
PyPI homepage, in the footer in French, “PyPI”, “Python Package Index”, and the Blocks logos […].
Django French docs - How to customize the shell command. Everything but the “Documentation” heading is in the wrong language.
Choosing a build backend - in the Japanese Python Packaging User Guide, The requires key is a list of packages […].

With the warning this is pretty cringe, here’s a recording of NVDA reading those French docs with mixed english on YouTube (thanks to Assistiv Labs for making this available for my project!)

What can be done

We tried to consider the options for Wagtail in early 2024 and didn’t get anywhere. We have more pressing accessibility issues to solve, but we still need to make a plan to address this one – hence why I’m here. I suspect there isn’t much specific to Wagtail here. The need is simple – add a lang attribute wherever needed. In practice this likely means:

Finding a way to detect whether for a given string, a translation is available or not.
If a translation exists, great, that translation matches the language of the overall page and there’s nothing further to do.
If the content is untranslated, determine the source language.
In that scenario, add a lang attribute on an HTML element around the string, with the source language as the attribute value.

Even assuming all of the above is possible, there’s still pretty challenging aspects:

This will bloat all UI code where content strings are sparse. For example Django templates with the {% translate %} template tag - imagine if every single use of that tag was preceded with a {% if %} and language check an output of a lang attribute.
This will mean a lot more forwarding of data between Python code where translations are often defined (_() helper functions), meaning a lot of code changes.

Anyway. Since this seems like such a prevalent problem I’d really like to see this addressed in Python directly rather than having to do a lot of research and devise workarounds for Django or Wagtail only. I’m not sure though if this would require changes to the gettext module, or it’s simply a matter of better official docs and community best practices. Or if this requires even bigger changes like a switch to MessageFormat 2 or similar more modern options.

But for now – I’d love to hear if others have thought about this / solved this, see examples of projects that might have solved this (in Python or other ecosystems), or just get feedback on whether people agree with my framing of the problem.

CAM-Gerlach · April 28, 2025, 3:27am

Thanks for bringing this up. It seems like an important problem, but I’m not entirely sure what specifically you are asking for isofar as the Python language itself is concerned.

As far as the stdlib gettext module goes, it seems to be mostly an issue for the rendering/output frontend extracting and injecting the appropriate tags or other indication per the format.

As far as the Python docs are concerned, solving this issue for the Python docs seems like it could be relatively straightforward, as Sphinx already knows what text is translated or not and adds the appropriate translated/untranslated class to the parent element (and this is reflected in the doctree attributes as well for use for other output formats), and could presumably be modified to output the lang as well (with a way to provide or infer the translation’s source/fallback language). Or, as a quick and dirty hack, this could be added in a few lines of custom JS at render time. @AA-Turner any insight here?

sirosen · April 28, 2025, 3:50am

Regarding the lang attribute, I’ve also seen that used on code blocks for highlightjs and other code highlighters. Is that an issue?

Anyway, +1 to looking for ways to improve translations, but I’m not ultra clear on what needs to change in terms of tooling.

thibaudcolas · April 28, 2025, 9:28am

Thank you both for looking into this! tl;dr; essentially I’d like people with more gettext experience than me to review this and advise if any of what I’m describing warrants changes in Python itself, or in how internationalization of Python projects is documented.

And consider the possible courses of action to:

Fix this for a given project (for example the Python docs translations).
Fix this / improve the status quo for the whole ecosystem of Python gettext users.

In both cases I’m not clear what possible solutions there are. Seeing this issue on every project I’ve ever reviewed makes me think there might be gaps in gettext capabilities, hence why I’m posting in this forum. But it could also just be documentation gaps, or simply a need for more awareness of accessibility across the community. I expect Python contributors with accessibility and translations expertise will be able to answer that. And there’s thousands of Python projects with this issue out there, so be great if there was a clear path to fixing it. And if long-term, “doing it right from the start” became the path of least resistance.

I’ve not noticed those translated/untranslated classes, that sounds very promising!

@sirosen as far as I know lang is only intended for human languages (see the HTML spec for lang). I don’t know whether that kind of misuse is just a semantic issue or also one for users, but am happy to research this a bit if you can point me to a real-world example.

Stanfromireland · April 28, 2025, 4:42pm

My comments are purely about gettext, I do not know about the docs but I think Adam Turner implemented something like you want recently in Sphinx.

As far as I know GNU gettext does not provide any such functionality.

For python gettext, all translations are currently loaded in to the _catalog (Note I am working on implementing using GNU gettext’s hash table lookup so this may change), which can be used to check if a message actually has a translation.

For example:

import gettext

translation = gettext.translation(...)

message = translation.gettext('Hello')
is_translated = message in translation._catalog

sirosen · April 28, 2025, 5:32pm

I was sure that highlightjs (JS) and rouge (Ruby) did this, but I now see that it’s actually done with class="language-python", etc. Sorry about that; but at least there’s no worry on this front!

thibaudcolas · April 30, 2025, 6:48am

@sirosen no worries! @Stanfromireland thank you – so clearly the capability is there, and this becomes a question of API design then, to retrieve this “is_translated” for every translated string, alongside the string?

Django has a lot of wrappers around gettext but from what I understand it’s only gettext_lazy that has a wrapper object, and the output of translation.gettext("message") is still string. So I assume we can’t have something like:

message = translation.gettext('Hello')
message.is_translated

…and instead would need a method to check the catalog after the fact? Feels tricky. Tricky in Python but I’m even more worried in templates, where there isn’t as much opportunities to reuse variables.

Real-world example from the Wagtail 404 page:

<a class="page404__button button" href="{% url 'wagtailadmin_home' %}">{% trans "Go to Wagtail admin" %}</a>

This would need to become either of:

<!-- Note how the starting tag isn’t closing. Grmbl. -->
<a class="page404__button button" href="{% url 'wagtailadmin_home' %}" {% trans_with_lang_attr "Go to Wagtail admin" %}</a>

Or:

<!-- Pretty verbose -->
{% trans "Go to Wagtail admin" as go_to_label %}
<a class="page404__button button" href="{% url 'wagtailadmin_home' %}" {% if go_to_label|is_untranslated %}lang="en" {% endif %}>{{ go_to_label }}</a>

I hope others have better ideas… those options seem achievable but also pretty far from the “path of least resistance” I was hoping for.

hugovk · April 30, 2025, 7:19am

This sounds the right sort of approach, if Sphinx can add lang=<the default, usually "en"> to the untranslated elements, somewhere around here:

github.com/sphinx-doc/sphinx

sphinx/transforms/i18n.py

c4929d026


      
          for node in NodeMatcher(nodes.Element, translated=Any).findall(self.document):
              if node['translated']:
                  if add_translated:
                      node.setdefault('classes', []).append('translated')  # type: ignore[arg-type]
              else:
                  if add_untranslated:
                      node.setdefault('classes', []).append('untranslated')  # type: ignore[arg-type]

Stanfromireland · May 1, 2025, 2:57pm

I have mixed feeling about this being added, it is simple enough to implement. The _catalogue could be made public.

thibaudcolas · May 2, 2025, 10:25am

I have no opinions about the API personally. For Django there are already wrappers around gettext so I assume we could ship our own API one way or the other. For users of vanilla gettext I assume the API and how it’s documented will make a big difference on adoption.

If I can find the time I’ll try to find different open source projects affected by this if more examples can help. As far as I know none of the Django ecosystem does this correctly. ckan and Jupyter also seem to have this issue. If anyone has suggestions for other large parts of the Python ecosystems that have multilingual UIs in HTML please let me know.

thibaudcolas · August 16, 2025, 11:54am

Update: after discussing this with @AA-Turner, I’ve opened an issue in Sphinx: lang attribute set to source language for untranslated text #13841. I tried to have a go at fixing this myself but there’s no straightforward way that I could find to infer the source language.

It would be great if “get source language” was part of the gettext API per message, but from what I see of the discussions here it seems more likely to be something that all the different translation implementations will need to reinvent? As in introduce a source_language = "en" in the Sphinx configuration?

If someone wants to pick this up please go for it, I don’t see myself having time to work further on this in Sphinx in the next 6 months.

For CPython

For CPython, the above would need to happen to fix the issue with the docs’ content. But then there is also a need for similar adaptations for anything in the docs’ UI (anything that uses {% trans %} in the theme’s HTML.

And same for the rest of the Python ecosystem, there’s no clear way to fix this for any of the projects marking their text as translatable within HTML templates.

Alternative: machine translations

For my projects, I’m more and more considering whether the best option might be to move from “partial translations with english fallbacks” to “partial human translations with machine-translated fallbacks”. This approach would solve the problem as technically 100% of the text would be in the target language, but then there’s a separate challenge of making sure the machine-translated fallbacks are good.

I have a feeling machine-translated fallbacks are more useful to people than english that they might not understand at all, but I’m not sure.

MRAB · August 16, 2025, 3:22pm

If machine-translated fallbacks were used, should they be marked as such, as an “unchecked” translation?

thibaudcolas · August 27, 2025, 5:27pm

Honestly no idea I’ve heard others are doing this (CPython’s turkish translations team?) but have little experience myself. It’d be great to see more projects trying that, I’ll try to give it a go myself and see what happens.

Stanfromireland · August 27, 2025, 7:11pm

The Turkish team does have messages Machine Translationed, but do not publish them in that state (this is against our translation style guide), they only publish messages reviewed by humans. Machine translations are often inaccurate.