Accessibility of multilingual content with mixed translation

:waving_hand: not 100% sure this is the right forum category but here goes. I’d like the Python ecosystem to have more accessible multilingual content. Almost all Python projects I’m involved with use the gettext module and GNU gettext as the foundation of their translations (of any user interface, user docs, contributor docs). And so far all those projects I’ve done accessibility reviews of share the same issue(s), that seem to come either from a lack of gettext capabilities, or a lack of understanding of accessibility requirements.

The issue is – projects have content in mixed languages within the one web page, without annotating what language a given word or run of text is (with the lang HTML attribute). This is a problem for users of assistive tech. For example, speech synthesizers use this information to correctly pronounce words. The words are unintelligible if they’re pronounced in the wrong language. It’s a clear accessibility fail, and also arguably an inclusivity issue in that this only affects people who aren’t using the content’s source language.

This is described in the Web Content Accessibility Guidelines (WCAG) 3.1.2: Language of Parts (Level AA):

The human language of each passage or phrase in the content can be programmatically determined except for proper names, technical terms, words of indeterminate language, and words or phrases that have become part of the vernacular of the immediately surrounding text.

In addition to the issue for real-world users, failure to meet this aspect of WCAG also means falling short of legal requirements. For example Section 508 for the federal sector in the USA, European accessibility act for some of the private sector in the EU. And many more around the world.

Examples

First off if you want a good example, I’d recommend the WCAG 2.0 French translation, for example the CAPTCHA definition. But in addition, here are examples of real-world content with this issue across multiple Python projects:

With the warning this is pretty cringe, here’s a recording of NVDA reading those French docs with mixed english on YouTube (thanks to Assistiv Labs for making this available for my project!)

What can be done

We tried to consider the options for Wagtail in early 2024 and didn’t get anywhere. We have more pressing accessibility issues to solve, but we still need to make a plan to address this one – hence why I’m here. I suspect there isn’t much specific to Wagtail here. The need is simple – add a lang attribute wherever needed. In practice this likely means:

  • Finding a way to detect whether for a given string, a translation is available or not.
  • If a translation exists, great, that translation matches the language of the overall page and there’s nothing further to do.
  • If the content is untranslated, determine the source language.
  • In that scenario, add a lang attribute on an HTML element around the string, with the source language as the attribute value.

Even assuming all of the above is possible, there’s still pretty challenging aspects:

  • This will bloat all UI code where content strings are sparse. For example Django templates with the {% translate %} template tag - imagine if every single use of that tag was preceded with a {% if %} and language check an output of a lang attribute.
  • This will mean a lot more forwarding of data between Python code where translations are often defined (_() helper functions), meaning a lot of code changes.

Anyway. Since this seems like such a prevalent problem I’d really like to see this addressed in Python directly rather than having to do a lot of research and devise workarounds for Django or Wagtail only. I’m not sure though if this would require changes to the gettext module, or it’s simply a matter of better official docs and community best practices. Or if this requires even bigger changes like a switch to MessageFormat 2 or similar more modern options.

But for now – I’d love to hear if others have thought about this / solved this, see examples of projects that might have solved this (in Python or other ecosystems), or just get feedback on whether people agree with my framing of the problem.

7 Likes

Thanks for bringing this up. It seems like an important problem, but I’m not entirely sure what specifically you are asking for isofar as the Python language itself is concerned.

As far as the stdlib gettext module goes, it seems to be mostly an issue for the rendering/output frontend extracting and injecting the appropriate tags or other indication per the format.

As far as the Python docs are concerned, solving this issue for the Python docs seems like it could be relatively straightforward, as Sphinx already knows what text is translated or not and adds the appropriate translated/untranslated class to the parent element (and this is reflected in the doctree attributes as well for use for other output formats), and could presumably be modified to output the lang as well (with a way to provide or infer the translation’s source/fallback language). Or, as a quick and dirty hack, this could be added in a few lines of custom JS at render time. @AA-Turner any insight here?

3 Likes

Regarding the lang attribute, I’ve also seen that used on code blocks for highlightjs and other code highlighters. Is that an issue?

Anyway, +1 to looking for ways to improve translations, but I’m not ultra clear on what needs to change in terms of tooling.

Thank you both for looking into this! tl;dr; essentially I’d like people with more gettext experience than me to review this and advise if any of what I’m describing warrants changes in Python itself, or in how internationalization of Python projects is documented.

And consider the possible courses of action to:

  1. Fix this for a given project (for example the Python docs translations).
  2. Fix this / improve the status quo for the whole ecosystem of Python gettext users.

In both cases I’m not clear what possible solutions there are. Seeing this issue on every project I’ve ever reviewed makes me think there might be gaps in gettext capabilities, hence why I’m posting in this forum. But it could also just be documentation gaps, or simply a need for more awareness of accessibility across the community. I expect Python contributors with accessibility and translations expertise will be able to answer that. And there’s thousands of Python projects with this issue out there, so be great if there was a clear path to fixing it. And if long-term, “doing it right from the start” became the path of least resistance.


I’ve not noticed those translated/untranslated classes, that sounds very promising!


@sirosen as far as I know lang is only intended for human languages (see the HTML spec for lang). I don’t know whether that kind of misuse is just a semantic issue or also one for users, but am happy to research this a bit if you can point me to a real-world example.

2 Likes

My comments are purely about gettext, I do not know about the docs but I think Adam Turner implemented something like you want recently in Sphinx.

As far as I know GNU gettext does not provide any such functionality.

For python gettext, all translations are currently loaded in to the _catalog (Note I am working on implementing using GNU gettext’s hash table lookup so this may change), which can be used to check if a message actually has a translation.

For example:

import gettext

translation = gettext.translation(...)

message = translation.gettext('Hello')
is_translated = message in translation._catalog
1 Like

I was sure that highlightjs (JS) and rouge (Ruby) did this, but I now see that it’s actually done with class="language-python", etc. Sorry about that; but at least there’s no worry on this front!

1 Like

@sirosen no worries! @Stanfromireland thank you :star: – so clearly the capability is there, and this becomes a question of API design then, to retrieve this “is_translated” for every translated string, alongside the string?

Django has a lot of wrappers around gettext but from what I understand it’s only gettext_lazy that has a wrapper object, and the output of translation.gettext("message") is still string. So I assume we can’t have something like:

message = translation.gettext('Hello')
message.is_translated

…and instead would need a method to check the catalog after the fact? Feels tricky. Tricky in Python but I’m even more worried in templates, where there isn’t as much opportunities to reuse variables.

Real-world example from the Wagtail 404 page:

<a class="page404__button button" href="{% url 'wagtailadmin_home' %}">{% trans "Go to Wagtail admin" %}</a>

This would need to become either of:

<!-- Note how the starting tag isn’t closing. Grmbl. -->
<a class="page404__button button" href="{% url 'wagtailadmin_home' %}" {% trans_with_lang_attr "Go to Wagtail admin" %}</a>

Or:

<!-- Pretty verbose -->
{% trans "Go to Wagtail admin" as go_to_label %}
<a class="page404__button button" href="{% url 'wagtailadmin_home' %}" {% if go_to_label|is_untranslated %}lang="en" {% endif %}>{{ go_to_label }}</a>

I hope others have better ideas… those options seem achievable but also pretty far from the “path of least resistance” I was hoping for.

This sounds the right sort of approach, if Sphinx can add lang=<the default, usually "en"> to the untranslated elements, somewhere around here:

1 Like

I have mixed feeling about this being added, it is simple enough to implement. The _catalogue could be made public.

:+1: I have no opinions about the API personally. For Django there are already wrappers around gettext so I assume we could ship our own API one way or the other. For users of vanilla gettext I assume the API and how it’s documented will make a big difference on adoption.


If I can find the time I’ll try to find different open source projects affected by this if more examples can help. As far as I know none of the Django ecosystem does this correctly. ckan and Jupyter also seem to have this issue. If anyone has suggestions for other large parts of the Python ecosystems that have multilingual UIs in HTML please let me know.