I am looking for something similar to glob.has_magic
for re
module, so that I can default to vanilla str.find
if it is pure vanilla string.
Does something like that exist in standard/external library?
I am looking for something similar to glob.has_magic
for re
module, so that I can default to vanilla str.find
if it is pure vanilla string.
Does something like that exist in standard/external library?
Note that glob.has_magic
is not a part of the public API.
You can just test re.escape(s) == s
. This will give you many false negatives, so you can want to write your own function to check that the string does not have ant characters that has special meaning in regulare expressions: “[”, “(”, “+”, “^”, "", etc. This will still give you some false negatives (for example “{” may have special meaning (in “a{1}”) or may be a literal “{” (in “a{b}”) – you need a simplified regexp parser to get more accurate answer).
There is no use cases for such a complex function tin the stdlib, so it is unlikely to be added.
Yeah, I am just seeing if anyone has done this already before I write my own.
If it does not exist I don’t think it should be added to standard library.
What I think could alternatively be a good idea is for re
to default to simple string search if input has none of magics. I think a big portion of queries certain applications (i.e. replacements for standard grep
tools and similar ad-hoc utilities) are just constant strings. At least from my experience.
Although there is no specific function for such check, but maybe such case can be easily inferred as part of compilation process?
Can a regex be written, to parse each regex with? It might not be the most maintainable code in practise, but I wouldn’t be able to resist it.
Also, note that it does identify constant escapes as magics. I had to do:
def has_glob_magic(patt):
parts = re.compile(r'\[([*?\[])\]').split(patt)
for p in parts[::2]:
if glob.has_magic(p):
return True
return False
Looking through the regex logic, there is already an optimisation for this situation. If the regex starts with literal characters, they’re collected into an “INFO” block at the start along with a skip table. Then when searching, that is used to quickly find exact matches, then continue with the rest of the regex, or just immediately succeed if it’s all literals.
Doing some quick tests, it seems a pure literal pattern is only slightly slower than str.find()
. So you probably don’t need to worry.
regex.escape
accepts a few command line arguments which may reduce the false negatives.
Much thanks for this. I was convinced that this isn’t the case. Saved me a lot of unnecessary effort.
This is not correct. File name “a[*]b
” is not the same as file name “a*b
”.
I think function name is simply misleading (and regex is incomplete). It should be has_wildcards
. Of course, if the path, which was the valid input for glob
was identified not to have wildcards, then replacement of [*]
with *
is needed to convert it back to raw form.
If you don’t mind using a non-public API you can use re._parser.parse
to validate that a given pattern consists of only literal tokens:
from re._parser import parse, LITERAL
from operator import itemgetter
print(parse('a{b}')) # [(LITERAL, 97), (LITERAL, 123), (LITERAL, 98), (LITERAL, 125)]
print(set(map(itemgetter(0), parse('a{b}'))) == {LITERAL}) # True
But then it would treat escaped literals (such as \a
) and a single-literal character set (such as [a]
) as one literal token, so to be more robust you can check if the values of the literal tokens matches the sequence of characters of the pattern:
from re._parser import parse, LITERAL
from operator import itemgetter
def has_magic(pattern):
chars = iter(pattern)
for token, value in parse(pattern):
if token != LITERAL or next(chars) != chr(value):
return True
return False
print(has_magic('a{b}')) # False
print(has_magic('a{1}')) # True
print(has_magic(r'\a')) # True
print(has_magic('[a]')) # True
or with any
:
def has_magic(pattern):
return any(
token != LITERAL or chr(value) != char
for (token, value), char in zip(parse(pattern), pattern)
)
Much thanks for this!
In essence my problem is to find whether a pattern can be expressed as a literal search, and then if it can, convert it to such.
Following your reply, I have:
import re._parser as re_parser
def re_maybe_literal(pattern):
tokens = re_parser.parse(pattern)
token_types = map(opr.itemgetter(0), tokens)
all_literal = set(token_types) == {re_parser.LITERAL}
if all_literal:
literal = ''.join(map(chr, map(opr.itemgetter(1), tokens)))
if isinstance(pattern, bytes):
literal = literal.encode()
return literal
return None
There are false positives:
re._parser.parse('a{1,1}')
# [(MAX_REPEAT, (1, 1, [(LITERAL, 97)]))]
This one is trivial as it could be argued that user shouldn’t write dumb
expressions, but at the same time it could (maybe even should given the below) be converted to literal at parse time.
But generally it works well. E.g.:
re._parser.parse('[a]')
# [(LITERAL, 97)]
which does exactly what I need.