Text segmentation API design

storchaka · December 21, 2025, 10:46am

I am currently working on improvementation of the Unicode support. More and more parts of Python support a rich text user interface – they need to know the width of the text in columns, they need textwrap supporting the colored text and text containing modifiers, wide characters, emoji. This is a complex problem, and the algorithmsa are language depending, so previously the answer was “use third-party libraries”. But now we need this for our own needs, we cannot make the stdlib depending on third-party libraries. Note that case transformation is also language depended, but str.upper() and str.lower() ar the methods of the builtin class, They implement the part that is language independed.

There are alsready several implementation for rough estimation of the text width in the stdlib (in traceback, _pyrepl). It turns out, there is no standard algorithm for it, but we most likely need to break a text on grapheme clusters to handle many corner cases (for example, the skin tone modifier has width 2 if alone, but has width 0 if applied to an appropriate emoji). So we needs to implement grapheme clust break algorithm.

And we have a question about API. An old proposition proposed to add a bunch of functions, like next_<indextype>(u, index) and prev_<indextype>(u, index). This is not paticularly convenien and not efficient, because for each index you will need to look at least one character before and after it, and sometimes you will need to step back to the safe point and start iterating until you achieve the requested index. They cannot be made O(1). More recent proposition proposed to use iterators which can cache an internal state. The question is what should emit the iterator? It can emit integers corresponding to the breaking points. It can emit substrings between neighboring breaking points (e.g. separate graphemes). It can emit slice objects. It can emit rich objects with attributes or methods returning the start and the end indices, the original string, the substring (something like re.Match or UnicodeError). We can have several iterators returning different types. Then what should be their names, and what should be the API of the rich object if we use it?

dg-pb · December 21, 2025, 1:32pm

It would be nice if it was part of re if:

If cost of Match construction is acceptable for applications in mind.
The places this is needed are ok to import re.

regex library has \X for default unicode rules.

Would get API of finditer/findall for free.
re would get extended functionality at the same time.

For language specificity could do something like:

re.show_segmenters()
re.add_segmenter('th', <SPEC>)
re.compile(r'\X{th}')

Various libraries could provide segmenter endpoints:

import emojis
re.add_segmenter('em', emojis.SEGMENTER)
re.compile(r'\X{em}')

storchaka · December 26, 2025, 9:36am

I implemented it as an iterator emitting rich objects – the start and end attributes give the indices of the start and the end of the grapheme cluster, applying str() returns the corresponding substring. If there are any suggestions, you are welcome.

github.com/python/cpython

gh-74902: Add Unicode Grapheme Cluster Break algorithm

main ← serhiy-storchaka:grapheme_cluster_break2

відкрито 04:34PM - 22 Dec 25 UTC

serhiy-storchaka

+4335 -3048

Add the unicodedata.iter_graphemes() function to iterate over grapheme clusters …according to rules defined in Unicode Standard Annex `#29`. Add unicodedata.grapheme_cluster_break(), unicodedata.indic_conjunct_break() and unicodedata.extended_pictographic() functions to get the properties of the character which are related to the above algorithm. * Issue: gh-74902

Maybe instead of iter_graphemes() returning an iterator, we could add graphemes() returning an iterable. Later we could implement support for reverse iteration, although this is a more difficult task, because all rules depend on a single following character, but some rules depend on multiple preceeding characters, so reverse iteration would need to use a lookahead of unspecified length.