What to do with re.LOCALE?

storchaka · August 30, 2023, 11:39am

The re.LOCALE flag and the (?L) mode modifier determine what is a letter and what characters are equal in case-insensitive mode in a byte pattern.

It only works with byte strings, not Unicode strings.
It only works in 8-bit locales, not UTF-8 or Shift-JIS. Today 8-bit locales are very rare.
It an order slower. It does not allow compile-time optimization, and at runtime calling tolower() is much slower than simple table lookup.

In the past there were issues with compiling pattern in one locale and using it in other locale and with caching, but they were fixed long time ago.

In 2014 I implemented support of re.LOCALE with Unicode strings, but it was not supported.

github.com/python/cpython

re.LOCALE is nonsensical for Unicode

opened 03:43PM - 14 Sep 14 UTC

closed 09:53AM - 01 Dec 14 UTC

serhiy-storchaka

type-feature stdlib extension-modules topic-regex topic-unicode

BPO | [22407](https://bugs.python.org/issue22407) --- | :--- Nosy | @pitrou, @vs…tinner, @ezio-melotti, @vadmium, @serhiy-storchaka Dependencies | <li>bpo-22838: Convert re tests to unittest</li> Files | <li>[re_unicode_locale.patch](https://bugs.python.org/file36615/re_unicode_locale.patch "Uploaded as text/plain at 2014-09-14.15:43:15 by @serhiy-storchaka")</li><li>[re_deprecate_unicode_locale.patch](https://bugs.python.org/file36853/re_deprecate_unicode_locale.patch "Uploaded as text/plain at 2014-10-09.15:10:20 by @serhiy-storchaka")</li> <sup>*Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.*</sup> <details><summary>Show more details</summary><p> GitHub fields: ```python assignee = 'https://github.com/serhiy-storchaka' closed_at = <Date 2014-12-01.09:53:48.617> created_at = <Date 2014-09-14.15:43:18.738> labels = ['extension-modules', 'expert-regex', 'type-feature', 'library', 'expert-unicode'] title = 're.LOCALE is nonsensical for Unicode' updated_at = <Date 2014-12-01.11:16:44.679> user = 'https://github.com/serhiy-storchaka' ``` bugs.python.org fields: ```python activity = <Date 2014-12-01.11:16:44.679> actor = 'python-dev' assignee = 'serhiy.storchaka' closed = True closed_date = <Date 2014-12-01.09:53:48.617> closer = 'serhiy.storchaka' components = ['Extension Modules', 'Library (Lib)', 'Regular Expressions', 'Unicode'] creation = <Date 2014-09-14.15:43:18.738> creator = 'serhiy.storchaka' dependencies = ['22838'] files = ['36615', '36853'] hgrepos = [] issue_num = 22407 keywords = ['patch'] message_count = 9.0 messages = ['226871', '226949', '226959', '226960', '228876', '231022', '231924', '231927', '231931'] nosy_count = 8.0 nosy_names = ['pitrou', 'vstinner', 'ezio.melotti', 'mrabarnett', 'Arfrever', 'python-dev', 'martin.panter', 'serhiy.storchaka'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue22407' versions = ['Python 3.5'] ``` </p></details>

The only visible effect of using re.LOCALE with Unicode string (besides slowing down) was different handling of letters I, İ, ı and i on the Turkish locale. Turkish is unique among all other languages. Maybe there were other differences with normal Unicode matching, but I did not noticed them.

I still see the use of re.LOCALE in wild code. In all cases it was unnecessary.

What should we do with re.LOCALE?

Deprecate and remove it.
Implement Unicode string support.
Leave it as it was.

0 voters

malemburg · August 30, 2023, 1:11pm

re.LOCALE made sense when we were still using the bytes version of str(). In the Unicode world, locale support is either no longer necessary or you need to get into the really complex ICU world (but that’s outside the scope of what we can support in the stdlib)…

hugovk · August 30, 2023, 2:40pm

Not much use in the top 5k projects:

$ python3 ~/github/misc/cpython/search_pypi_top.py -q . "re\.LOCALE"
./drf-extensions-0.7.1.tar.gz: drf-extensions-0.7.1/docs/backdoc.py: "l": re.LOCALE,
./sphinx_toolbox-3.4.0.tar.gz: sphinx_toolbox-3.4.0/sphinx_toolbox/more_autodoc/regex.py: if flags & re.LOCALE:
./schema-0.7.5.tar.gz: schema-0.7.5/schema.py: "re.LOCALE",
./behave-1.2.6.tar.gz: behave-1.2.6/behave/configuration.py: # -- NOTE: re.LOCALE is removed in Python 3.6 (deprecated in Python 3.5)
./behave-1.2.6.tar.gz: behave-1.2.6/behave/configuration.py: # flags = (re.UNICODE | re.LOCALE)
./djlint-1.31.1.tar.gz: djlint-1.31.1/src/djlint/lint.py: "re.LOCALE": re.LOCALE,
./pockets-0.9.1.tar.gz: pockets-0.9.1/CHANGES: * Fixes Python 3.6 compatibility by only using re.LOCALE flag on Python 2
./pockets-0.9.1.tar.gz: pockets-0.9.1/pockets/string.py: to (re.LOCALE | re.MULTILINE | re.UNICODE).
./markdown2-2.4.9.tar.gz: markdown2-2.4.9/lib/markdown2.py: "l": re.LOCALE,
./pymongo-4.4.0.tar.gz: pymongo-4.4.0/bson/__init__.py: if flags & re.LOCALE:
./pymongo-4.4.0.tar.gz: pymongo-4.4.0/bson/json_util.py: if obj.flags & re.LOCALE:
./pymongo-4.4.0.tar.gz: pymongo-4.4.0/bson/regex.py: flags |= re.LOCALE
./pyre2-0.3.6.tar.gz: pyre2-0.3.6/src/compile.pxi: return fallback(original_pattern, flags, "re.LOCALE not supported")
./pyre2-0.3.6.tar.gz: pyre2-0.3.6/src/re2.pyx: LOCALE = re.LOCALE

Time: 0:00:18.548066
Found 14 matching lines in 9 projects

AA-Turner · August 30, 2023, 4:27pm

You’d also need to look for re.L, (?L), and (?L:) – but the inline flags can contain multiple modifiers ^[1] (I’m not going to atempt to make a pattern to find them!)

A

(?aiLmsux) and (?aiLmsux-imsx:...); docs ↩︎

hugovk · August 30, 2023, 4:31pm

$ python3 ~/github/misc/cpython/search_pypi_top.py -q . "\bre\.L\b"
./apprise-1.4.0.tar.gz: apprise-1.4.0/apprise/utils.py: 'L': re.L,
./djlint-1.31.1.tar.gz: djlint-1.31.1/src/djlint/lint.py: "re.L": re.L,
./parsimonious-0.10.0.tar.gz: parsimonious-0.10.0/parsimonious/expressions.py: (locale and re.L) |
./pockets-0.9.1.tar.gz: pockets-0.9.1/pockets/string.py: RE_FLAGS = re.L | re.M | re.U
./pymongo-4.4.0.tar.gz: pymongo-4.4.0/bson/json_util.py: "l": re.L,
./pymongo-4.4.0.tar.gz: pymongo-4.4.0/test/test_bson.py: regex = re.compile(b"", re.I | re.L | re.M | re.S | re.X)
./pymongo-4.4.0.tar.gz: pymongo-4.4.0/test/test_bson.py: self.assertEqual(re.I | re.L | re.M | re.S | re.X, Regex.from_native(regex).flags)
./pyre2-0.3.6.tar.gz: pyre2-0.3.6/src/re2.pyx: I, M, S, U, X, L = re.I, re.M, re.S, re.U, re.X, re.L

Time: 0:00:41.166755
Found 8 matching lines in 6 projects

I expect the others are a similar order of magnitude.

storchaka · September 18, 2023, 2:15pm

The majority of votes are in favor of deprecation.

vstinner · November 30, 2023, 7:53pm

During the deprecation, if there is a strike of users who want to keep the feature, we can keep it. With more concrete use cases, we can better understand how it is used.

Topic		Replies	Views
Enum -- last call for comments on 3.10 changes Core Development	0	529	June 28, 2021
Structural Pattern Matching Should Permit Regex String Matches Ideas	9	4281	January 12, 2023
Nicer interface for str.translate Ideas	29	2594	August 7, 2023
Allow changing regular expression options not at the start of the string Ideas	2	312	January 13, 2024
Practical applications of string-like bytes methods? Python Help	13	655	July 26, 2023

What to do with re.LOCALE?

Related Topics