`re` equivalent of `glob.has_magic`

I am looking for something similar to glob.has_magic for re module, so that I can default to vanilla str.find if it is pure vanilla string.

Does something like that exist in standard/external library?

Note that glob.has_magic is not a part of the public API.

You can just test re.escape(s) == s. This will give you many false negatives, so you can want to write your own function to check that the string does not have ant characters that has special meaning in regulare expressions: “[”, “(”, “+”, “^”, "", etc. This will still give you some false negatives (for example “{” may have special meaning (in “a{1}”) or may be a literal “{” (in “a{b}”) – you need a simplified regexp parser to get more accurate answer).

There is no use cases for such a complex function tin the stdlib, so it is unlikely to be added.

2 Likes

Yeah, I am just seeing if anyone has done this already before I write my own.

If it does not exist I don’t think it should be added to standard library.

What I think could alternatively be a good idea is for re to default to simple string search if input has none of magics. I think a big portion of queries certain applications (i.e. replacements for standard grep tools and similar ad-hoc utilities) are just constant strings. At least from my experience.

Although there is no specific function for such check, but maybe such case can be easily inferred as part of compilation process?

1 Like

Can a regex be written, to parse each regex with? It might not be the most maintainable code in practise, but I wouldn’t be able to resist it.

Also, note that it does identify constant escapes as magics. I had to do:

def has_glob_magic(patt):
    parts = re.compile(r'\[([*?\[])\]').split(patt)
    for p in parts[::2]:
        if glob.has_magic(p):
            return True
    return False

Looking through the regex logic, there is already an optimisation for this situation. If the regex starts with literal characters, they’re collected into an “INFO” block at the start along with a skip table. Then when searching, that is used to quickly find exact matches, then continue with the rest of the regex, or just immediately succeed if it’s all literals.

Doing some quick tests, it seems a pure literal pattern is only slightly slower than str.find(). So you probably don’t need to worry.

5 Likes

regex.escape accepts a few command line arguments which may reduce the false negatives.

Much thanks for this. I was convinced that this isn’t the case. Saved me a lot of unnecessary effort.

This is not correct. File name “a[*]b” is not the same as file name “a*b”.

1 Like

I think function name is simply misleading (and regex is incomplete). It should be has_wildcards. Of course, if the path, which was the valid input for glob was identified not to have wildcards, then replacement of [*] with * is needed to convert it back to raw form.

If you don’t mind using a non-public API you can use re._parser.parse to validate that a given pattern consists of only literal tokens:

from re._parser import parse, LITERAL
from operator import itemgetter

print(parse('a{b}')) # [(LITERAL, 97), (LITERAL, 123), (LITERAL, 98), (LITERAL, 125)]
print(set(map(itemgetter(0), parse('a{b}'))) == {LITERAL}) # True

But then it would treat escaped literals (such as \a) and a single-literal character set (such as [a]) as one literal token, so to be more robust you can check if the values of the literal tokens matches the sequence of characters of the pattern:

from re._parser import parse, LITERAL
from operator import itemgetter

def has_magic(pattern):
    chars = iter(pattern)
    for token, value in parse(pattern):
        if token != LITERAL or next(chars) != chr(value):
            return True
    return False

print(has_magic('a{b}')) # False
print(has_magic('a{1}')) # True
print(has_magic(r'\a')) # True
print(has_magic('[a]')) # True

or with any:

def has_magic(pattern):
    return any(
        token != LITERAL or chr(value) != char
        for (token, value), char in zip(parse(pattern), pattern)
    )
2 Likes

Much thanks for this!

In essence my problem is to find whether a pattern can be expressed as a literal search, and then if it can, convert it to such.

Following your reply, I have:

import re._parser as re_parser

def re_maybe_literal(pattern):
    tokens = re_parser.parse(pattern)
    token_types = map(opr.itemgetter(0), tokens)
    all_literal = set(token_types) == {re_parser.LITERAL}
    if all_literal:
        literal = ''.join(map(chr, map(opr.itemgetter(1), tokens)))
        if isinstance(pattern, bytes):
            literal = literal.encode()
        return literal
    return None

There are false positives:

re._parser.parse('a{1,1}')
# [(MAX_REPEAT, (1, 1, [(LITERAL, 97)]))]

This one is trivial as it could be argued that user shouldn’t write dumb expressions, but at the same time it could (maybe even should given the below) be converted to literal at parse time.

But generally it works well. E.g.:

re._parser.parse('[a]')
# [(LITERAL, 97)]

which does exactly what I need.