Allow beta UCD files to be used in the future?

clin1234 · March 21, 2026, 12:42am

The next beta review period of Unicode, Unicode 18, is expected to last from May to July 2026. Beta Review Status. Given that according to PEP 790 – Python 3.15 Release Schedule | peps.python.org, no new features will be allowed starting in 3.15beta1, such feature would be added in the upcoming 3.16: PEP 826 – Python 3.16 Release Schedule | peps.python.org

This may seem contrived now, but suppose that a future Unicode’s beta review period overlaps with a future Python version’s alpha releases. I think that it would be helpful within cpython/Tools/unicode/makeunicodedata.py at main · python/cpython to allow pre-release UCD files to be exposed in the unicodedata module through some option, given that by a Unicode’s beta review period, code points and names are guranteed to be fixed.

Rosuav · March 21, 2026, 1:15am

What is still able to change, and will any of that cause confusion in Python?

clin1234 · March 21, 2026, 1:02pm

According to the site, only property values and algorithms of the characters are subject to revision, and they are very unlikely

hugovk · April 14, 2026, 6:36am

Date	Python 3.15	Unicode 18.0.0
5 May	b1
26 May		Beta opens
2 Jun	b2
23 Jun	b3
7 Jul		Beta closes
18 Jul	b4
4 Aug	RC1
1 Sep	RC2
15 Sep		Final release
1 Oct	Final release

Their beta review is fully contained in our beta period:

Beta Review

The primary focus for the beta review phase aims is review of character property data and any algorithmic changes which may impact implementations of the new release. Character repertoire has been established, and can be considered stable at this point. Additions or removals of characters are very unlikely before release. The code points and names of new characters are also no longer subject to revision.

A complete set of data files for the UCD is provided during beta review, as well as complete data for synchronized Unicode Technical standards (UTS #10, UTS #39, UTS #46, and UTS #51) . Implementers should be checking this new data and report on any problems or inconsistencies discovered. All other assets planned for release are also updated for the beta review, including the preliminary code charts and all the annexes. Feedback on those other assets is also welcome during beta review.

ncoghlan · April 14, 2026, 7:07am

We benefit from the likes of Fedora accommodating our beta releases within their beta testing periods, so adopting a comparable policy for the standard library with respect to the Unicode database beta period seems reasonable to me.

malemburg · April 14, 2026, 7:51am

How exactly would you see that working ?

Note that the Unicode database is not only used for the data in unicodedata, but also for many other string methods, so changes are immediately and widely visible, even if you don’t make use of unicodedata.

I’m not against the idea of optionally making it available, but it does add quite a bit of maintenance overhead and it’s also not clear how testing would work.

If you’re only interested in the unicodedata module getting updated, this would make things easier, but then you could also use a package on PyPI to be able to access multiple versions of the Unicode database for this purpose.

SnoopJ · April 14, 2026, 5:03pm

I’m a soft +1 on this, especially as I was recently reminded of some changes to makeunicodedata.py that were required in the course of a routine update. It’s nice to get advance notice of that kind of thing. On the other hand, I don’t think there is a big need for this in CPython.

Actually, I think no big changes to Tools/ (and associated maintenance burden) are required to support this. The UCD files are all named the same way, and the only difference in the unicode.org URLs is the use of draft in place of the version. The build seems to work fine if we change UNIDATA_VERSION = “draft”, see below.

On the assumption that this continues to work (decent odds, unicode.org URLs are pretty predictable), the only change necessary is probably to allow the target UNIDATA_VERSION to be changed at the commandline. The bigger concern is what the workflow for testing against Unicode draft versions should be, and how that intersects with the existing workflows that a release manager is juggling (i.e. along the lines of Hugo’s remarks).

[jgerity@giskard ~/repos/cpython (main 2026-04-14 *⏳)]
$ git diff Tools/unicode/  # unicodedata files were also updated by the tool
diff --git a/Tools/unicode/makeunicodedata.py b/Tools/unicode/makeunicodedata.py
index 5db850ca2d1..6a465bfdaf1 100644
--- a/Tools/unicode/makeunicodedata.py
+++ b/Tools/unicode/makeunicodedata.py
@@ -45,7 +45,7 @@
 #   * Doc/library/stdtypes.rst, and
 #   * Doc/library/unicodedata.rst
 #   * Doc/reference/lexical_analysis.rst (three occurrences)
-UNIDATA_VERSION = "17.0.0"
+UNIDATA_VERSION = "draft"
 UNICODE_DATA = "UnicodeData%s.txt"
 COMPOSITION_EXCLUSIONS = "CompositionExclusions%s.txt"
 EASTASIAN_WIDTH = "EastAsianWidth%s.txt"
[jgerity@giskard ~/repos/cpython (main 2026-04-14 *⏳)]
$ ./configure && make -j10
... snip ...
[jgerity@giskard ~/repos/cpython (main 2026-04-14 *⏳)]
$ ./python -c "import sys, unicodedata; print(sys.version); print(unicodedata.name(chr(0x16D98)))"  # randomly selected codepoint new in 18.0.0
3.15.0a8+ (heads/main-dirty:3cb7eaec857, Apr 14 2026, 12:07:40) [GCC 13.3.0]
CHISOI SIGN ANUSVARA
[jgerity@giskard ~/repos/cpython (main 2026-04-14 *⏳)]
$ python3.14 -c "import sys, unicodedata; print(sys.version); print(unicodedata.name(chr(0x16D98)))"
3.14.2 (main, Dec  9 2025, 15:26:04) [GCC 12.3.0]
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import sys, unicodedata; print(sys.version); print(unicodedata.name(chr(0x16D98)))
                                                       ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
ValueError: no such name

I’ll plug the unicodedata2 project here, which is patterned directly on unicodedata, has decent backwards compatibility, and usually updates pretty promptly. It is not built with multiple versions in mind, and has the same narrow perspective on the UCD that the stdlib module does, but when you want ”unicodedata but when a not-out-of-date UCD” it works great.

There are some other projects on PyPI that aim for a more complete picture of UCD but none of the ones I’ve looked at seem to be particularly active.