Get changed offsets of unicode normalization?

The function unicodedata.normalize( form , unistr) is useful for converting certain sequences of unicode codepoints into different, normalized sequences. This is also a very important step in natural language processing (NLP).

However, in NLP, we often have data structures referring to parts of a string (e.g. to words within the string) by their start and end offsets (so called “stand-off annotation”). such offsets would not be valid any more after normalizing the text.

Therefore it would be extremely useful to have a variant of the unicodedata.normalize( form , unistr) function which also returns the offset changes caused by the normalization, eg mapping each old offset to the corresponding new offset.

Is there a known way to achieve this?

How would one have to go about creating a feature request for this to get added to Python?

Hi Johann,

Can you give an example of how this would work, showing input, output, and what you would do with the output?

newoffset = len(unicodedata.normalize(form, unistr[:offset]))

Hi Johann,

it is usually best to normalize Unicode text before letting it enter any
processing, including indexing it.

If you really need a mapping to the original text, it’s possible to
rewrite the implementation to maintain such an index and put this on
PyPI.

You can find the implementation in this file:

I don’t think this is a feature which is requested often enough to
get added to Python’s stdlib.

1 Like

Yet another approach would be to use difflib to calculate the changes.

Thank you everyone for those answers!
@storchaka this would be very inefficient for a large number of offsets, since the normalize function would get recalculated for slightly different ranges again and again.
@malemburg thanks, I already had a look at that code and was thinking to do that as a last resort option, but wanted to see what the opinion on usefulness is. I think it is a pity to not have this in the original code, as copying the code would need to always sync with whatever changes are made to the original or bit rot over time. I did not really think this through enough to know if there could be an implementation with almost no overhead if no offset mappings are requested, using just one function for both situations, but my feeling is that this should not be too hard.
Guido (apparently I can only mention 2 users in a post) again this would be rather inefficient in comparison with just keeping track of the offsets during normalization if it would be guaranteed to always work, but I think not even that can be guaranteed in all situations as difflib uses the Rattcliff Obershelp algorithm for finding diffs, which may e.g. create diffs where several changes are included in one replacement sequence (they are not guaranteed to be minimum edits).