Get changed offsets of unicode normalization?

johann-petrak · March 6, 2022, 8:43am

The function unicodedata.normalize( form , unistr) is useful for converting certain sequences of unicode codepoints into different, normalized sequences. This is also a very important step in natural language processing (NLP).

However, in NLP, we often have data structures referring to parts of a string (e.g. to words within the string) by their start and end offsets (so called “stand-off annotation”). such offsets would not be valid any more after normalizing the text.

Therefore it would be extremely useful to have a variant of the unicodedata.normalize( form , unistr) function which also returns the offset changes caused by the normalization, eg mapping each old offset to the corresponding new offset.

Is there a known way to achieve this?

How would one have to go about creating a feature request for this to get added to Python?

steven.daprano · March 5, 2022, 5:54pm

Hi Johann,

Can you give an example of how this would work, showing input, output, and what you would do with the output?

storchaka · March 6, 2022, 10:33am

newoffset = len(unicodedata.normalize(form, unistr[:offset]))

malemburg · March 6, 2022, 12:35pm

Hi Johann,

it is usually best to normalize Unicode text before letting it enter any
processing, including indexing it.

If you really need a mapping to the original text, it’s possible to
rewrite the implementation to maintain such an index and put this on
PyPI.

You can find the implementation in this file:

github.com

python/cpython/blob/main/Modules/unicodedata.c

/* ------------------------------------------------------------------------

   unicodedata -- Provides access to the Unicode database.

   The current version number is reported in the unidata_version constant.

   Written by Marc-Andre Lemburg (mal@lemburg.com).
   Modified for Python 2.0 by Fredrik Lundh (fredrik@pythonware.com)
   Modified by Martin v. Löwis (martin@v.loewis.de)

   Copyright (c) Corporation for National Research Initiatives.

   ------------------------------------------------------------------------ */

#ifndef Py_BUILD_CORE_BUILTIN
#  define Py_BUILD_CORE_MODULE 1
#endif
#define NEEDS_PY_IDENTIFIER

#define PY_SSIZE_T_CLEAN

This file has been truncated. show original

I don’t think this is a feature which is requested often enough to
get added to Python’s stdlib.

guido · March 6, 2022, 3:12pm

Yet another approach would be to use difflib to calculate the changes.

johann-petrak · March 6, 2022, 3:37pm

Thank you everyone for those answers!
@storchaka this would be very inefficient for a large number of offsets, since the normalize function would get recalculated for slightly different ranges again and again.
@malemburg thanks, I already had a look at that code and was thinking to do that as a last resort option, but wanted to see what the opinion on usefulness is. I think it is a pity to not have this in the original code, as copying the code would need to always sync with whatever changes are made to the original or bit rot over time. I did not really think this through enough to know if there could be an implementation with almost no overhead if no offset mappings are requested, using just one function for both situations, but my feeling is that this should not be too hard.
Guido (apparently I can only mention 2 users in a post) again this would be rather inefficient in comparison with just keeping track of the offsets during normalization if it would be guaranteed to always work, but I think not even that can be guaranteed in all situations as difflib uses the Rattcliff Obershelp algorithm for finding diffs, which may e.g. create diffs where several changes are included in one replacement sequence (they are not guaranteed to be minimum edits).