Unicode identifiers as python entrypoints

julienmalardadam · April 8, 2025, 1:57pm

Unicode can be used for Python identifiers and filenames (PEP 3131 – Supporting Non-ASCII Identifiers | peps.python.org), but currently do not work as console scripts, whether as command names or as code file paths.

For example:

[console_scripts]
நான் = ஓர்.ஒருங்குறி:கட்டளை

Proposed solution

I propose that the packaging standard be updated to include all valid PEP3131 identifiers as console script file paths and command names.

steve.dower · April 8, 2025, 2:34pm

This is more likely to be a bug in either the installer or the build backend than a specification. At a guess, the file is open() with default encoding, which on Windows is going to assume the current codepage (for compatibility reasons), and so since the file has come from another machine, it’s probably encoded incorrectly. Requiring UTF-8 encoding would be the answer, but chances are whichever code is trying to read it can simply try both and succeed virtually all the time.

We also can’t require something that isn’t supported by whatever file system a user happens to be using, so it’s possible that that’s the constraint.

Do you have any error messages showing how it actually fails? That will help identify what needs to be updated.

pf_moore · April 8, 2025, 2:58pm

The relevant standard is here.

The file is defined to be UTF-8, so that’s not an issue. Group names are constrained to be " groups of letters, numbers and underscores, separated by dots (regex ^\w+(\.\w+)*$)" but entry point names and object references aren’t constrained (object references must be valid Python identifiers, but that’s a language definition, not a packgaging restriction).

So if the example you propose doesn’t work, it’s likely just a bug in an application. Pip uses either importlib.metadata or pkg_resources to get entry point data, so if it’s happening with pip, I’d assume it’s one of those two implementations (possibly both, although my suspicion would be that it’s pkg_resources, just because that’s a lot older, and was developed before Unicode awareness was as prevalent as it is now).

julienmalardadam · April 8, 2025, 2:59pm

Thanks for the feedback! It’s in particular a bug with Python installer ([DO NOT MERGE!] Fix unicode entrypoint issue by julienmalard · Pull Request #254 · pypa/installer · GitHub), but they wanted some community feedback before implementing a fix. It seems that the issue there is mostly a problem with the re module ([DO NOT MERGE!] Fix unicode entrypoint issue by julienmalard · Pull Request #254 · pypa/installer · GitHub).

pf_moore · April 8, 2025, 3:35pm

FWIW, I have confirmed that importlib.metadata handles the entry point you gave correctly:

>>> from importlib.metadata import entry_points
>>> next(iter(entry_points()))
EntryPoint(name='நான்', value='ஓர்.ஒருங்குறி:கட்டளை', group='console_scripts')

I don’t know what installer might be doing (as @steve.dower said, we really need to see an actual error to understand what the problem is here) but the regex you’re claiming is a problem is present in importlib.metadata and it doesn’t cause problems there, so I think your diagnosis is wrong.

Actually, the re module handles Unicode just fine:

>>> re.match(r"\w", 'நான்')
<re.Match object; span=(0, 1), match='ந'>

julienmalardadam · April 8, 2025, 3:55pm

So perhaps importlib.metadata does handle this correctly. Here’s the failing test from installer: [DO NOT MERGE!] Fix unicode entrypoint issue · pypa/installer@660cf99 · GitHub, which is fixed by replacing the regex.
The tests do not work with the \w regex with Indic diactrics (running on நான் will not match the entire word, nor will வணக்கம்). This issue is fixed in the regex module, but this is not a solution for low-level packages such as installer that can’t have other modules as dependencies.

pf_moore · April 8, 2025, 4:13pm

OK. At this point I’m out of my depth regarding Unicode (I’ve no idea what the status of Indic diacritics is). I guess it is a bug (in importlib.metadata in the stdlib, the backport importlib_metadata, installer, and pkg_resources, all of which use the same regex as far as I can tell.

So if you want, you could ask for all of those places to be fixed. It’s probably not worth just getting some fixed, as you’ll just end up with bugs in other places. (And there’s still the problem that Steve mentioned, which is that there’s no guarantee that such names will be usable as executable names on all platforms).

But it is true that the data you’re using is standards-compliant. If that’s all you want to know, then you have your answer

MegaIng · April 8, 2025, 9:11pm

Why would this be a bug in re instead of an incorrect regex?

nonspacing mark is not in any way an alphanumeric symbol. Instead the regex should be updated to be more permissive. I believe the third party regex module is in the wrong here and shouldn’t have made this change. Yes, it’s more “useful”, but it also fails to follow either of unicode ^[1] or python ^[2]. The exact regex for identifiers as defined by python or by unicode are difficult to express in stdlib’s re. It’s possible, but still unweidly in regex

If the standards actually place no restriction on these keys, then neither should any of the libraries. (outside of those that are technically impossible because of having to deal with file systems). If there are security concerns here, they should be documented in a standard.

To follow unicode the character would have need to have the Other_Alphabetic property ↩︎
which tightly defines it in str.isalnum ↩︎

pf_moore · April 8, 2025, 10:17pm

The exact quotes from the spec are

Group names must be one or more groups of letters, numbers and underscores, separated by dots (regex ^\w+(\.\w+)*$).
The name may contain any characters except =, but it cannot start or end with any whitespace character, or start with [. For new entry points, it is recommended to use only letters, numbers, underscores, dots and dashes (regex [\w.-]+).
The object reference points to a Python object. It is either in the form importable.module, or importable.module:object.attr. Each of the parts delimited by dots and the colon is a valid Python identifier.

For console_scripts and gui_scripts there are two further restrictions (the first of which can’t realistically be enforced by parsers):

In both groups, the name of the entry point should be usable as a command in a system shell after the package is installed.
As files are created from the names, and some filesystems are case-insensitive, packages should avoid using names in these groups which differ only in case. The behaviour of install tools when names differ only in case is undefined.

The key thing is that the object reference is defined in terms of Python identifiers, not regexes, or even Unicode properties. If Python were to change what’s allowable in an identifier, the rules here would change, too. The full specification of what’s a valid identifier is in the language spec.

To ask a more practical question, though, do people actually name their Python functions using Indic diacritics? I live in an English-speaking country, so my understanding of Unicode is largely theoretical. As such, I’ve no real idea whether we’re debating an obscure edge case of the spec, or a problem that causes genuine issues to a significant number of developers.

To me, the regex used in the existing libraries looks like a reasonable (albeit incomplete as we’ve now established) compromise. For practical purposes, it’s been fine for many years. I’m not against improving the spec compliance, but I’d want the discussion to be focused on “does this address actual user problems” rather than being simply about exact conformance to the spec.

MegaIng · April 9, 2025, 12:02am

(Semi relevant side note: I tried to understand why regex decided to add support for Mark to their definition of \w. Sadly, they linked to the wrong unicode annex, making me waste half an hour trying to figure out a document that wasn’t even relevant. What they actually meant instead of tr27 is tr18 which lists a definition of word character that is AFAICT incompatible with their definition of word and their definition of alphanumeric. But it is probably a decent compromise. If python’s re module decided to fully follow these recommendations it would probably resolve this issue)

julienmalardadam · April 9, 2025, 7:04am

Thanks for the confirmation on the standard; that helps a lot! Regarding the other packages, do you have a link to the part of the code with the regex in question? I wasn’t able to find the location in the source code.

julienmalardadam · April 9, 2025, 7:11am

Regarding usefullness of coding in non-English languages, PEP 3131 made the case when deciding to allow non-English identifiers (PEP 3131 – Supporting Non-ASCII Identifiers | peps.python.org). This article also makes a good point.

julienmalardadam · April 9, 2025, 7:12am

Practically speaking, this issue is preventing the lassi project (https://லஸ்ஸி.இந்தியா) from publishing a command-line tool (லஸ்ஸி · GitHub). Until these issues are resolved, it will be difficult to get large numbers of programmers coding in Indian (and other) languages, however (a bit of a chicken and egg problem).