Ability to query minimum and maxium width of regular expression

I made a Feature Request a few weeks ago on github, but didn’t recieve any positive or negative comments. So before submitting a PR, I am going to post here.

Essentially, lark has a usecase for accessing the minimal and maximal width an re.Pattern instance can match. This is currently accessible with re._parser.parse(...).getwidth(). Since the move of the modules in python/cpython#91308 and a small change in behavior in 3.11.7 [1] we only noticed because a user reported a bug, a consistent and public interface would be nice. This could easily take the form of an attribute on the Pattern object, I see no reason to expose any other part of the internals of re._parser, at least for our usecase. If there is a different effort/plan to expose all of that for public consumption, that would of course also satisfy our need.


Specifically in terms of interface I could imagine:

  • An attribute/property widths (or lengths or sizes) containing a tuple (min, max)
  • A method getwidths (getlengths, get_lengths, …) returning this same tuple
  • Two attributes min_width, max_width (or *_length, …)
  • Two functions getminwidth and getmaxwidth

This is approximately my order of preference.

For all of these max should IMO be None to signal infinity instead of using re._parser.MAXWIDTH. Random magic numbers for a very legitimate concept don’t seem appealing to me.


In terms of implementation, it would not be a lot. I made a basic implementation in ~an hour, the hardest part being tests. The calculation code already exists, and is being called when compiling a regex. I currently have a mostly finished change set that stores an extra attribute containing the tuple. This would mean extra memory usage in all patterns. If that is a concern, this information could instead be dynamically extracted from the internal compiled code. [2] [3]


The third party regex module would have a bit of trouble with supporting this addition since their syntax is non-regular enough for the bounds to be potentially uncomputable or at the very least quite tricky/expensive to compute. How they would handle that (for example heuristics) should however not effect what the stdlib does.


There are edge cases where the existing width calculations are not correct because they ignore lookahead/lookbehinds: (?!abc)abc|defg would return the bounds (3, 4) despite there not existing any match of length 3. I would say these cases can be ignored because of how pathatlogical and easily explainable they are.


  1. Introduction of re._parser.MAXWIDTH, replacing re._constant.MAXREPEAT which previously filled that role ↩︎

  2. This technically has a lower resolution of only supporting up to 32bit widths instead of 64bit widths. However, those both sound unlikely as bounds for regex widths to me ↩︎

  3. Dynamically computing the values would change my preference for the interface to method above attribute ↩︎

2 Likes

Since I am getting zero feedback here or on GitHub, I am just going to open two PRs for the different implementations, both exposing an attribute.

What makes you think that you will get better feedback on a PR? That feels rather presumptuous. Deafening silence usually means one of several things: either nobody cares, or everybody is too busy. Creating more noise isn’t going to help your case.

Well, at least on a PR there are actually concrete stuff to talk about. Deafening Silence can also mean “I have nothing to add, go ahead”. +0 vs -0 in terms of PEP 10. Since I still think this doesn’t require a PEP, creating a PR seems to be the logical next step if there are no objections.

Unless of course the CPython core devs prefer us using the internal, undocumented APIs. I distinctly remembering some core devs semi-frustrated saying something like “instead of just using the internals, just ask us for it to be added to the public interface” in the context of the sre module reorganization (although i can’t find that comment now). Should I just give this up since noone is interested?

The problem with a PR is that you’re adding code that the core team will have to maintain forever. If someone wanted to sign up for that they’d probably have given you a +1 here. (Though during the holidays a lot of people might just have been archiving or deleting their messages unread.)

1 Like

Is there a way for me to sign up to maintain this part of the code? Or is that just an impossible hurdle to overcome and I should just give up and continue to use the internal API?

And the issue on github also has been open well before the holidays. But I guess that is even less visible than discourse.

I seen the issue and this topic, but I was silent, because I did not know how to answer. I understand your case, but I also know pitfalls and limitations of current code.

I changed SubPattern.getwidth() several times (last time few months ago), and I cannot guarantee that it is the final version. If we going to expose its result in a public API, we should freeze its behavior. For now, it only exist to answer the following questions:

  1. Does the subpattern match a single character? False negative is acceptable.
  2. Does the subpattern match a fixed length string and what is this length? If the length is greater than some architecture depended limit, treat this as error. Not fixed length is a different error.
  3. What is the minimal length of a string matched by the subpattern? Returning smaller value is acceptable. If it is greater than some architecture depended limit, use that limit.

The exact result can be arbitrary large (for example, (((x{1000000}){1000000}){1000000}){1000000}), processing large integers is slow and meanless, because all values larger than some limit are equal in the context of compiling the RE. But we need to distinguish unlimited value from larger that the limit value, and distinguish the lower and the upper values even if they are larger than the limit.

While it is an implementation detail, we can keep it as simple and fast as we need, because we control both ends. But if it is exposed to users, we need to design something that can be useful for wide range of users, with simpler semantic, and do not cut edges, even if it is less efficient for the main purpose.

Well, this is at least a decent counter argument and not just complete silence.

The semantics I had in mind were designed with the idea of optimizing for realistic patterns. The edge cases where a simple approach to bound handling would require matches >1GB (i.e. more than 2**31) if I am reading the code correctly, which IMO are not situations we have to care about.

IMO the following semantics would be ok:

  • An upper bound above the architecture limit gets truncated to None to signal infinity.
  • A lower bound above the architecture limit gets truncated to this limit.

Note that both of these semantics are fully within the description of those bounds. The docs can just mention that there is an architecture defined limit at which these bounds will no longer be tight fits and this limit might change in minor versions. As long as the common case of patterns at most in the realm of a few MB are correctly supported (which I don’t think should interfere with changing the internals), everything should be fine.

Yes, this is cutting edges, however not all edges are the same IMO.