Ability to query minimum and maxium width of regular expression

MegaIng · December 19, 2023, 5:44am

I made a Feature Request a few weeks ago on github, but didn’t recieve any positive or negative comments. So before submitting a PR, I am going to post here.

github.com/python/cpython

Ability to query minimum and maxium length of regular expression

opened 11:40PM - 24 Nov 23 UTC

MegaIng

type-feature stdlib topic-regex

# Feature or enhancement ### Proposal: For the [lark](https://github.com/l…ark-parser/lark) parsing library we currently use the private `re._parser` module, as noticed when reorganizing the relevant libraries in #91308. The only information we need is the minimum and maximum width of a match a pattern can have. My suggestion is to add relevant attributes/properties to the `Pattern` class, for example as with the names `min_width` and `max_width`. `max_width` could be either `None` or `MAXREPEAT` (the constant from `re._constants`/`_sre`) when the pattern could match an (essentially) unlimited amount of text. ```python pattern = re.compile(r"abc?d?e") assert (pattern.min_width, pattern.max_width) == (3, 5) pattern = re.compile(r"(a*b+){2, 5}") assert (pattern.min_width, pattern.max_width) == (2, None) ``` As an alternative, the `re._*` modules could be made a public and stable API, although this doesn't appear to be a well liked option from my reading of the above linked PR. I would like this, primarily for implementing custom regex analyzers (there a few such users of the `re._parser` module out there), but I think this would have to be a PEP. ### Has this already been discussed elsewhere? This is a minor feature, which does not need previous discussion elsewhere ### Links to previous discussion of this feature: I don't think this is a major enough feature to require widespread discussion. I [requested](https://github.com/mrabarnett/mrab-regex/issues/513) a similar feature in the third party `regex` library. Preferably ofcourse both would have the same interface.

Essentially, lark has a usecase for accessing the minimal and maximal width an re.Pattern instance can match. This is currently accessible with re._parser.parse(...).getwidth(). Since the move of the modules in python/cpython#91308 and a small change in behavior in 3.11.7 ^[1] we only noticed because a user reported a bug, a consistent and public interface would be nice. This could easily take the form of an attribute on the Pattern object, I see no reason to expose any other part of the internals of re._parser, at least for our usecase. If there is a different effort/plan to expose all of that for public consumption, that would of course also satisfy our need.

Specifically in terms of interface I could imagine:

An attribute/property widths (or lengths or sizes) containing a tuple (min, max)
A method getwidths (getlengths, get_lengths, …) returning this same tuple
Two attributes min_width, max_width (or *_length, …)
Two functions getminwidth and getmaxwidth

This is approximately my order of preference.

For all of these max should IMO be None to signal infinity instead of using re._parser.MAXWIDTH. Random magic numbers for a very legitimate concept don’t seem appealing to me.

In terms of implementation, it would not be a lot. I made a basic implementation in ~an hour, the hardest part being tests. The calculation code already exists, and is being called when compiling a regex. I currently have a mostly finished change set that stores an extra attribute containing the tuple. This would mean extra memory usage in all patterns. If that is a concern, this information could instead be dynamically extracted from the internal compiled code. ^[2] ^[3]

The third party regex module would have a bit of trouble with supporting this addition since their syntax is non-regular enough for the bounds to be potentially uncomputable or at the very least quite tricky/expensive to compute. How they would handle that (for example heuristics) should however not effect what the stdlib does.

There are edge cases where the existing width calculations are not correct because they ignore lookahead/lookbehinds: (?!abc)abc|defg would return the bounds (3, 4) despite there not existing any match of length 3. I would say these cases can be ignored because of how pathatlogical and easily explainable they are.

Introduction of re._parser.MAXWIDTH, replacing re._constant.MAXREPEAT which previously filled that role ↩︎
This technically has a lower resolution of only supporting up to 32bit widths instead of 64bit widths. However, those both sound unlikely as bounds for regex widths to me ↩︎
Dynamically computing the values would change my preference for the interface to method above attribute ↩︎

MegaIng · January 2, 2024, 1:20pm

Since I am getting zero feedback here or on GitHub, I am just going to open two PRs for the different implementations, both exposing an attribute.

guido · January 2, 2024, 5:06pm

What makes you think that you will get better feedback on a PR? That feels rather presumptuous. Deafening silence usually means one of several things: either nobody cares, or everybody is too busy. Creating more noise isn’t going to help your case.

MegaIng · January 2, 2024, 5:33pm

Well, at least on a PR there are actually concrete stuff to talk about. Deafening Silence can also mean “I have nothing to add, go ahead”. +0 vs -0 in terms of PEP 10. Since I still think this doesn’t require a PEP, creating a PR seems to be the logical next step if there are no objections.

Unless of course the CPython core devs prefer us using the internal, undocumented APIs. I distinctly remembering some core devs semi-frustrated saying something like “instead of just using the internals, just ask us for it to be added to the public interface” in the context of the sre module reorganization (although i can’t find that comment now). Should I just give this up since noone is interested?

guido · January 2, 2024, 6:30pm

The problem with a PR is that you’re adding code that the core team will have to maintain forever. If someone wanted to sign up for that they’d probably have given you a +1 here. (Though during the holidays a lot of people might just have been archiving or deleting their messages unread.)

MegaIng · January 2, 2024, 6:43pm

Is there a way for me to sign up to maintain this part of the code? Or is that just an impossible hurdle to overcome and I should just give up and continue to use the internal API?

And the issue on github also has been open well before the holidays. But I guess that is even less visible than discourse.

storchaka · January 2, 2024, 7:44pm

I seen the issue and this topic, but I was silent, because I did not know how to answer. I understand your case, but I also know pitfalls and limitations of current code.

I changed SubPattern.getwidth() several times (last time few months ago), and I cannot guarantee that it is the final version. If we going to expose its result in a public API, we should freeze its behavior. For now, it only exist to answer the following questions:

Does the subpattern match a single character? False negative is acceptable.
Does the subpattern match a fixed length string and what is this length? If the length is greater than some architecture depended limit, treat this as error. Not fixed length is a different error.
What is the minimal length of a string matched by the subpattern? Returning smaller value is acceptable. If it is greater than some architecture depended limit, use that limit.

The exact result can be arbitrary large (for example, (((x{1000000}){1000000}){1000000}){1000000}), processing large integers is slow and meanless, because all values larger than some limit are equal in the context of compiling the RE. But we need to distinguish unlimited value from larger that the limit value, and distinguish the lower and the upper values even if they are larger than the limit.

While it is an implementation detail, we can keep it as simple and fast as we need, because we control both ends. But if it is exposed to users, we need to design something that can be useful for wide range of users, with simpler semantic, and do not cut edges, even if it is less efficient for the main purpose.

MegaIng · January 2, 2024, 7:58pm

Well, this is at least a decent counter argument and not just complete silence.

The semantics I had in mind were designed with the idea of optimizing for realistic patterns. The edge cases where a simple approach to bound handling would require matches >1GB (i.e. more than 2**31) if I am reading the code correctly, which IMO are not situations we have to care about.

IMO the following semantics would be ok:

An upper bound above the architecture limit gets truncated to None to signal infinity.
A lower bound above the architecture limit gets truncated to this limit.

Note that both of these semantics are fully within the description of those bounds. The docs can just mention that there is an architecture defined limit at which these bounds will no longer be tight fits and this limit might change in minor versions. As long as the common case of patterns at most in the realm of a few MB are correctly supported (which I don’t think should interfere with changing the internals), everything should be fine.

Yes, this is cutting edges, however not all edges are the same IMO.

Topic		Replies	Views
Python 3.10.0a6 is now available Committers release	4	739	March 5, 2021
Request for review: support for ** wildcard in pathlib.PurePath.match() Core Development	3	885	May 30, 2023
Hint or advice how to improve docs about `reprlib.Repr.indent` feature Documentation	2	378	May 14, 2022
When exactly was this bug fixed? Python Help	2	315	July 26, 2023
Possible bug in `re` module in Python 3.11? Python Help release , help	3	1443	October 26, 2022

Ability to query minimum and maxium width of regular expression

Related Topics