Text segmentation API design

I am currently working on improvementation of the Unicode support. More and more parts of Python support a rich text user interface – they need to know the width of the text in columns, they need textwrap supporting the colored text and text containing modifiers, wide characters, emoji. This is a complex problem, and the algorithmsa are language depending, so previously the answer was “use third-party libraries”. But now we need this for our own needs, we cannot make the stdlib depending on third-party libraries. Note that case transformation is also language depended, but str.upper() and str.lower() ar the methods of the builtin class, They implement the part that is language independed.

There are alsready several implementation for rough estimation of the text width in the stdlib (in traceback, _pyrepl). It turns out, there is no standard algorithm for it, but we most likely need to break a text on grapheme clusters to handle many corner cases (for example, the skin tone modifier has width 2 if alone, but has width 0 if applied to an appropriate emoji). So we needs to implement grapheme clust break algorithm.

And we have a question about API. An old proposition proposed to add a bunch of functions, like next_<indextype>(u, index) and prev_<indextype>(u, index). This is not paticularly convenien and not efficient, because for each index you will need to look at least one character before and after it, and sometimes you will need to step back to the safe point and start iterating until you achieve the requested index. They cannot be made O(1). More recent proposition proposed to use iterators which can cache an internal state. The question is what should emit the iterator? It can emit integers corresponding to the breaking points. It can emit substrings between neighboring breaking points (e.g. separate graphemes). It can emit slice objects. It can emit rich objects with attributes or methods returning the start and the end indices, the original string, the substring (something like re.Match or UnicodeError). We can have several iterators returning different types. Then what should be their names, and what should be the API of the rich object if we use it?

10 Likes

It would be nice if it was part of re if:

  1. If cost of Match construction is acceptable for applications in mind.
  2. The places this is needed are ok to import re.

regex library has \X for default unicode rules.

Would get API of finditer/findall for free.
re would get extended functionality at the same time.

For language specificity could do something like:

re.show_segmenters()
re.add_segmenter('th', <SPEC>)
re.compile(r'\X{th}')

Various libraries could provide segmenter endpoints:

import emojis
re.add_segmenter('em', emojis.SEGMENTER)
re.compile(r'\X{em}')

I implemented it as an iterator emitting rich objects – the start and end attributes give the indices of the start and the end of the grapheme cluster, applying str() returns the corresponding substring. If there are any suggestions, you are welcome.

Maybe instead of iter_graphemes() returning an iterator, we could add graphemes() returning an iterable. Later we could implement support for reverse iteration, although this is a more difficult task, because all rules depend on a single following character, but some rules depend on multiple preceeding characters, so reverse iteration would need to use a lookahead of unspecified length.

3 Likes

Now that PR 143076 is merged, would it help if I created an issue listing all the places in the stdlib that could benefit from this? On top of my head this includes all the places that assume “string length is string width”, like textwrap and some string formatting methods.

Great and this is something I’ve been waiting for. a while. I’m curious if the API will be easily adaptable, in the unlikely case in the future we want to implement the other unicode segmentation algorithms (word break and line break) as well?[1] I don’t think we will necessarily need the other algorithms for our intended use in Python (not for now at least), but seeing that they’re all rule-based algorithms I assume that the same implementation strategy can go well with them.


  1. A while ago I’ve the foolishness to implement something similar in pure Python using the UCD data files, it kind of works (and even passes the official tests :smiling_face_with_tear:) but performance-wise it’ll never beat something written in C. Unfortunately I can’t really use ICU4C which has a habit of broken builds on my machine. ↩︎

As I think this relates to this discussion, the wcwidth library was released in version 0.3 that includes changes covering all the use cases I was thinking of when I raised this issue:

  • correct wrapping using a new wcwidth.wrap
  • correct width using a new wcwidth.width (including ansi escape codes handling)
  • same with center/justification of text