Currently, the unicodedata module supports two versions of the Unicode Character Database – the latest one (17.0.0 for now), and 3.2.0, which is needed for the idna module. The 3.2.0 data is represented as a diff to the latest one, so it only takes a tiny fraction of the size of the complete database. I wonder if there is a use case for supporting more versions. For example, the third-party module wcwidth supports multiple Unicode versions – when you work with terminal that supports older Unicode version, you maybe interesting to use the same version (although support of wide and combined characters inconsistent across terminals, so you cannot expect much even if use the correct version).
Should we provide support for all Unicode versions starting from 3.2.0 in the stdlib, or leave it for third-party module? Since all infrastructure is already in place, it would be easy to support more versions – the only downside is increasing the size of the binary and the time of building the data (which is not needed for each build).
A semi-related question: should Python support using (potentially newer) system unicode-data, much like it supports using system timezone data? FWICS it isn’t particularly popular, but e.g. gucharmap and ibus do support using it, so perhaps it would make sense to follow suit.
As I understand it, a property of a codepoint has a default value until it’s defined. For example, the General Category property of a codepoint is Unassigned (Cn) until it’s defined, so Unassigned is the default value for General Category.
You could use the Age property to determine the property value of a codepoint for older Unicode versions.
If, say, the age of a codepoint is later than 3.2.0, then its property value at 3.2.0 was the default value.
Doing that would be slower than having multiple versions, but it would take up a lot less space.
While there may be special use cases requiring access to earlier versions of the Unicode database, but I don’t think they warrant adding more complexity to the implementation.
The idna case, which caused Martin to add support to keep the 3.2.0 data in the stdlib, is unfortunate in itself. The standard should have had provisions for dealing with newer Unicode versions to not make this necessary.
So overall, I don’t think it’s worth the effort and having users pay for a feature they will most likely never need. It’s also not clear which versions to support (apart from 3.2.0).
Perhaps one day we’ll get a updated idna standard for internationalized domains, which no longer fixes the version. Then we can remove the diff logic again and simplify the implementation.
This is also available via the PyPI module idna: idna · PyPI
Unfortunately, support for the new version is a bit brittle. Many browsers chose to keep IDNA 2003 support around via UTS #46: Unicode IDNA Compatibility Processing and special handling of edge cases. It’ll probably take a few more years before IDNA 2003 is phased out completely.
The current idna module is based on RFC 3490 and RFC 3491 which were obsoleted by RFC 5891 which is Unicode version agnostic.
I ask because I am going to add several more functions (to support grapheme clusters, line breaks and the width calculation) and need to plan how to represent the data. If it is only 3.2.0, we can ignore it (nobody will use new functions with 3.2.0) or focus on the latest version. If we support multiple Unicode versions, we should be more flexible, some data does not exist in old versions. This can also affect the optimal representation, for example, there were only 6 code points with Indic_Conjunct_Break=Linker in 15.1.0 and 16.0.0, so the linear search could be appropriate, but 17.0.0 their number is 20.
There are absolutely use cases for being able to specify a unicode version to use. Today it is a rather global setting. What would the API look like to specify otherwise? Is “just” an additional keyword argument on the .encode() and .decode() API and the codecs module sufficient to allow code to be explicit? Probably not. But it could lead to messes to be able to override the process wide default.
Unicode version upgrades are an occasional challenge when upgrading a Python runtime as it observably alters behavior of how data gets processed. The places that really care limited, which is why I exposing the ability on methods feels good, but the reality of Python code is often that decoding of incoming data and encoding of outgoing data happens at a place far removed from the code consuming it that cares.
Should unicode ^W (edit: I meant str aka PyUnicode) objects carry metadata about the version their binary data was decoded using that matches the representation? That could allow detection and recovery via an API to normalize it against a specific version (often a no-op as most strings are compatible across a wide range of versions).
This gets complicated and isn’t a widespread need, which is why I expect we’ve just never tried to deal with it before. Likely far easier for a third party PyPI project to correctly handle as the problems that result from bugs in such logic are large.
The versions already being something primarily (only?) exposed via our unicodedata module makes it feel easier API wise to support additional versions. While I like @mgorny’s point of being able to load alternate versions rather than us being responsible for shipping all of them as that’d enable people on an older Python to use later versions it feels complicated.
Internally our use of it is relatively limited:
Old 3.2.0: IDNA’s NFKC normalization (urllib.parse) + encodings.idna + related stringprep.
Modern v17:
traceback uses it for column offset rendering and _pyrepl for similar rendering and editing reasons.
re._parser uses unicodedata.lookup() for escape sequence name lookup. As does unicodeobject.c and codecs.c to for \N name lookup/replacement.
Plus the PEG parser for normalization of identifiers (PEP-3131).
All of those modern uses are I expect more likely something you’d want to control globally for the process if possible.
… but the challenge with dynamically supporting new versions is that we compile this in today for speed reasons via the header files generated from the unicode db via Tools/unicode/makeunicodedata.py.
Supporting a dynamic unicode database feels like a major feature to maintain, is there really demand for that?
Serhiy’s original post was that we could already compile in diffs against additional previous versions if we wanted to. I’m not sure there’s really much demand for that?
For your new functions I’d just start simple and assume the answer is “no” - if we ever decide to those can be revisted and expanded.
I looked at the unicode-data package in Ubuntu (unicode-data package : Ubuntu). Is it what you meant? It is just a set of text files downloaded from https://www.unicode.org/ (with zip archives unpacked). They are the same files that are used to generate binary data (in more compact and efficient form) for unicodedata. There is no any API, you have to read and parse text files sequentially. There is only one Unicode version – 15.1.0 in the latest non-LTS Ubuntu (Python already includes 17.0.0).