I am currently working on improvementation of the Unicode support. More and more parts of Python support a rich text user interface – they need to know the width of the text in columns, they need textwrap supporting the colored text and text containing modifiers, wide characters, emoji. This is a complex problem, and the algorithmsa are language depending, so previously the answer was “use third-party libraries”. But now we need this for our own needs, we cannot make the stdlib depending on third-party libraries. Note that case transformation is also language depended, but str.upper() and str.lower() ar the methods of the builtin class, They implement the part that is language independed.
There are alsready several implementation for rough estimation of the text width in the stdlib (in traceback, _pyrepl). It turns out, there is no standard algorithm for it, but we most likely need to break a text on grapheme clusters to handle many corner cases (for example, the skin tone modifier has width 2 if alone, but has width 0 if applied to an appropriate emoji). So we needs to implement grapheme clust break algorithm.
And we have a question about API. An old proposition proposed to add a bunch of functions, like next_<indextype>(u, index) and prev_<indextype>(u, index). This is not paticularly convenien and not efficient, because for each index you will need to look at least one character before and after it, and sometimes you will need to step back to the safe point and start iterating until you achieve the requested index. They cannot be made O(1). More recent proposition proposed to use iterators which can cache an internal state. The question is what should emit the iterator? It can emit integers corresponding to the breaking points. It can emit substrings between neighboring breaking points (e.g. separate graphemes). It can emit slice objects. It can emit rich objects with attributes or methods returning the start and the end indices, the original string, the substring (something like re.Match or UnicodeError). We can have several iterators returning different types. Then what should be their names, and what should be the API of the rich object if we use it?