storchaka
(Serhiy Storchaka)
August 30, 2023, 11:39am
1
The re.LOCALE
flag and the (?L)
mode modifier determine what is a letter and what characters are equal in case-insensitive mode in a byte pattern.
It only works with byte strings, not Unicode strings.
It only works in 8-bit locales, not UTF-8 or Shift-JIS. Today 8-bit locales are very rare.
It an order slower. It does not allow compile-time optimization, and at runtime calling tolower()
is much slower than simple table lookup.
In the past there were issues with compiling pattern in one locale and using it in other locale and with caching, but they were fixed long time ago.
In 2014 I implemented support of re.LOCALE
with Unicode strings, but it was not supported.
opened 03:43PM - 14 Sep 14 UTC
closed 09:53AM - 01 Dec 14 UTC
type-feature
stdlib
extension-modules
topic-regex
topic-unicode
BPO | [22407](https://bugs.python.org/issue22407)
--- | :---
Nosy | @pitrou, @vs… tinner, @ezio-melotti, @vadmium, @serhiy-storchaka
Dependencies | <li>bpo-22838: Convert re tests to unittest</li>
Files | <li>[re_unicode_locale.patch](https://bugs.python.org/file36615/re_unicode_locale.patch "Uploaded as text/plain at 2014-09-14.15:43:15 by @serhiy-storchaka")</li><li>[re_deprecate_unicode_locale.patch](https://bugs.python.org/file36853/re_deprecate_unicode_locale.patch "Uploaded as text/plain at 2014-10-09.15:10:20 by @serhiy-storchaka")</li>
<sup>*Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.*</sup>
<details><summary>Show more details</summary><p>
GitHub fields:
```python
assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2014-12-01.09:53:48.617>
created_at = <Date 2014-09-14.15:43:18.738>
labels = ['extension-modules', 'expert-regex', 'type-feature', 'library', 'expert-unicode']
title = 're.LOCALE is nonsensical for Unicode'
updated_at = <Date 2014-12-01.11:16:44.679>
user = 'https://github.com/serhiy-storchaka'
```
bugs.python.org fields:
```python
activity = <Date 2014-12-01.11:16:44.679>
actor = 'python-dev'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2014-12-01.09:53:48.617>
closer = 'serhiy.storchaka'
components = ['Extension Modules', 'Library (Lib)', 'Regular Expressions', 'Unicode']
creation = <Date 2014-09-14.15:43:18.738>
creator = 'serhiy.storchaka'
dependencies = ['22838']
files = ['36615', '36853']
hgrepos = []
issue_num = 22407
keywords = ['patch']
message_count = 9.0
messages = ['226871', '226949', '226959', '226960', '228876', '231022', '231924', '231927', '231931']
nosy_count = 8.0
nosy_names = ['pitrou', 'vstinner', 'ezio.melotti', 'mrabarnett', 'Arfrever', 'python-dev', 'martin.panter', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue22407'
versions = ['Python 3.5']
```
</p></details>
The only visible effect of using re.LOCALE
with Unicode string (besides slowing down) was different handling of letters I
, İ
, ı
and i
on the Turkish locale. Turkish is unique among all other languages. Maybe there were other differences with normal Unicode matching, but I did not noticed them.
I still see the use of re.LOCALE
in wild code. In all cases it was unnecessary.
What should we do with re.LOCALE
?
Deprecate and remove it.
Implement Unicode string support.
Leave it as it was.
3 Likes
malemburg
(Marc-André Lemburg)
August 30, 2023, 1:11pm
2
re.LOCALE made sense when we were still using the bytes version of str(). In the Unicode world, locale support is either no longer necessary or you need to get into the really complex ICU world (but that’s outside the scope of what we can support in the stdlib)…
2 Likes
hugovk
(Hugo van Kemenade)
August 30, 2023, 2:40pm
3
Not much use in the top 5k projects:
$ python3 ~/github/misc/cpython/search_pypi_top.py -q . "re\.LOCALE"
./drf-extensions-0.7.1.tar.gz: drf-extensions-0.7.1/docs/backdoc.py: "l": re.LOCALE,
./sphinx_toolbox-3.4.0.tar.gz: sphinx_toolbox-3.4.0/sphinx_toolbox/more_autodoc/regex.py: if flags & re.LOCALE:
./schema-0.7.5.tar.gz: schema-0.7.5/schema.py: "re.LOCALE",
./behave-1.2.6.tar.gz: behave-1.2.6/behave/configuration.py: # -- NOTE: re.LOCALE is removed in Python 3.6 (deprecated in Python 3.5)
./behave-1.2.6.tar.gz: behave-1.2.6/behave/configuration.py: # flags = (re.UNICODE | re.LOCALE)
./djlint-1.31.1.tar.gz: djlint-1.31.1/src/djlint/lint.py: "re.LOCALE": re.LOCALE,
./pockets-0.9.1.tar.gz: pockets-0.9.1/CHANGES: * Fixes Python 3.6 compatibility by only using re.LOCALE flag on Python 2
./pockets-0.9.1.tar.gz: pockets-0.9.1/pockets/string.py: to (re.LOCALE | re.MULTILINE | re.UNICODE).
./markdown2-2.4.9.tar.gz: markdown2-2.4.9/lib/markdown2.py: "l": re.LOCALE,
./pymongo-4.4.0.tar.gz: pymongo-4.4.0/bson/__init__.py: if flags & re.LOCALE:
./pymongo-4.4.0.tar.gz: pymongo-4.4.0/bson/json_util.py: if obj.flags & re.LOCALE:
./pymongo-4.4.0.tar.gz: pymongo-4.4.0/bson/regex.py: flags |= re.LOCALE
./pyre2-0.3.6.tar.gz: pyre2-0.3.6/src/compile.pxi: return fallback(original_pattern, flags, "re.LOCALE not supported")
./pyre2-0.3.6.tar.gz: pyre2-0.3.6/src/re2.pyx: LOCALE = re.LOCALE
Time: 0:00:18.548066
Found 14 matching lines in 9 projects
1 Like
AA-Turner
(Adam Turner)
August 30, 2023, 4:27pm
4
You’d also need to look for re.L
, (?L)
, and (?L:)
– but the inline flags can contain multiple modifiers (I’m not going to atempt to make a pattern to find them!)
A
hugovk
(Hugo van Kemenade)
August 30, 2023, 4:31pm
5
$ python3 ~/github/misc/cpython/search_pypi_top.py -q . "\bre\.L\b"
./apprise-1.4.0.tar.gz: apprise-1.4.0/apprise/utils.py: 'L': re.L,
./djlint-1.31.1.tar.gz: djlint-1.31.1/src/djlint/lint.py: "re.L": re.L,
./parsimonious-0.10.0.tar.gz: parsimonious-0.10.0/parsimonious/expressions.py: (locale and re.L) |
./pockets-0.9.1.tar.gz: pockets-0.9.1/pockets/string.py: RE_FLAGS = re.L | re.M | re.U
./pymongo-4.4.0.tar.gz: pymongo-4.4.0/bson/json_util.py: "l": re.L,
./pymongo-4.4.0.tar.gz: pymongo-4.4.0/test/test_bson.py: regex = re.compile(b"", re.I | re.L | re.M | re.S | re.X)
./pymongo-4.4.0.tar.gz: pymongo-4.4.0/test/test_bson.py: self.assertEqual(re.I | re.L | re.M | re.S | re.X, Regex.from_native(regex).flags)
./pyre2-0.3.6.tar.gz: pyre2-0.3.6/src/re2.pyx: I, M, S, U, X, L = re.I, re.M, re.S, re.U, re.X, re.L
Time: 0:00:41.166755
Found 8 matching lines in 6 projects
I expect the others are a similar order of magnitude.
1 Like
storchaka
(Serhiy Storchaka)
September 18, 2023, 2:15pm
6
The majority of votes are in favor of deprecation.
2 Likes