By Leonard Dye via Discussions on Python.org at 19Aug2022 21:03:
I’m almost happy with this script.
I still find this approach rather strange. You’re getting a textual
printout of the names in a module, and scanning that single text string
for your target names.
If your objective is to use regexps to scan text, rather than purely
to classifify the names, this may be sensible. But if you’re just trying
to identify your nondunder names, I think that converting dir()
to a
single string and scanning it is a complex and error prone way to do
this.
All that said, let’s look at your code for the concern you’ve expressed:
One problem is the lack of one ’ at the beginning of the list. I don’t
understand this.The first as well as all commands should be in a
single quote.
The result of dir(math)
is a list
of str
instances:
>>> import math
>>> dir(math)
['__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'acos', 'acosh', 'asin', 'asinh', 'atan', 'atan2', 'atanh', 'ceil', 'comb', 'copysign', 'cos', 'cosh', 'degrees', 'dist', 'e', 'erf', 'erfc', 'exp', 'expm1', 'fabs', 'factorial', 'floor', 'fmod', 'frexp', 'fsum', 'gamma', 'gcd', 'hypot', 'inf', 'isclose', 'isfinite', 'isinf', 'isnan', 'isqrt', 'lcm', 'ldexp', 'lgamma', 'log', 'log10', 'log1p', 'log2', 'modf', 'nan', 'nextafter', 'perm', 'pi', 'pow', 'prod', 'radians', 'remainder', 'sin', 'sinh', 'sqrt', 'tan', 'tanh', 'tau', 'trunc', 'ulp']
and str(dir(math))
, which is what you are scanning with a regexp, is
this:
>>> str(dir(math))
"['__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'acos', 'acosh', 'asin', 'asinh', 'atan', 'atan2', 'atanh', 'ceil', 'comb', 'copysign', 'cos', 'cosh', 'degrees', 'dist', 'e', 'erf', 'erfc', 'exp', 'expm1', 'fabs', 'factorial', 'floor', 'fmod', 'frexp', 'fsum', 'gamma', 'gcd', 'hypot', 'inf', 'isclose', 'isfinite', 'isinf', 'isnan', 'isqrt', 'lcm', 'ldexp', 'lgamma', 'log', 'log10', 'log1p', 'log2', 'modf', 'nan', 'nextafter', 'perm', 'pi', 'pow', 'prod', 'radians', 'remainder', 'sin', 'sinh', 'sqrt', 'tan', 'tanh', 'tau', 'trunc', 'ulp']"
Here’s your regular expression:
(\b[a-z][^_]+\b[^_])
We can ignore the surrounding ()
, as you are not currently using the
subgroups, so:
\b[a-z][^_]+\b[^_]
being:
- a word boundary
- an alphabetic character
- 1 or more non-underscore characters
- a word boundary
- a non-underscore
This has some problems, exhibited in your output, but let’s look at your
“missing leading quote mark” issue first.
The first character matched in your expression is an alphabetic
character. So you’re not matching a quote mark, and it will not be
included in your match.
You do match a non-underscore at the end of the expression, and as it
happens that non-underscore is a quote mark in the text, so you get a
quote mark at the end of the match.
All this is because re.findall
, which is the correct thing to do for
the approach you are taking, does not have to match at the start of the
text. You could include the character preceeding the word boundary in
your match, like this (using your “non-underscore” criterion):
[^_]\b[a-z][^_]+\b[^_]
which would pick up the leading quote mark in the matched text.
A larger concern is that you have an overly generous regular expression.
Your intent, I gather, is obtain a list of the nondunder names from
dir(math)
. You print these names out in a loop at the end with
distinct print()
calls. That should print one name per line:
for item in commands_list:
print(item)
However, look at the output:
acos', 'acosh', 'asin', 'asinh', 'atan', 'atan2', 'atanh', 'ceil', 'copysign', 'cos', 'cosh', 'degrees', 'e', 'erf', 'erfc', 'exp', 'expm1', 'fabs', 'factorial', 'floor', 'fmod', 'frexp', 'fsum', 'gamma', 'gcd', 'hypot', 'inf', 'isclose', 'isfinite', 'isinf', 'isnan', 'ldexp', 'lgamma', 'log', 'log10', 'log1p', 'log2', 'modf', 'nan', 'pi', 'pow', 'radians', 'remainder', 'sin', 'sinh', 'sqrt', 'tan', 'tanh', 'tau', 'trunc'
That is one line of text, indicating that there is just one very long
string in commands_list
, not many short strings. Try printing
len(commands_list)
to check this.
Why is this so? Let’s review your regular expression:
\b[a-z][^_]+\b[^_]
being:
- a word boundary
- an alphabetic character
- 1 or more non-underscore characters
- a word boundary
- a non-underscore
The core concern here is that a “non-underscore” (from [^_]
) matches
anything that is not an underscore, including punctuation. So it
happily consumes these characters after the names:
', '
The \b
word boundary is just that: a boundary marker. It occurs
between a “word” and “nonword” character in either order. It has no
other direct effect. Specificly, it does not force matched stuff between
the marks to “be word characters”.
Your regular expression happily matches the entire string from the
first word to the last, as a single match.
What you primarily need to do is to ensure that middle of the word is
only what you want. I would imagine that to be, maybe, just letters. So
instead of matching non-underscores with [^_]
you should perhaps match
latters with [a-z]
, resulting in this:
\b[a-z][a-z]+\b[^_]
The expression would match a letters-only word followed by a
non-underscore. Because there is a word boundary before that
non-underscore, it cannot match eg abcde_
since that is considered a
“word” for purposes of \b
and thus there would not be a “word
boundary” between the e
and the _
. So your final “non-underscore”
can effectively only match punctuation. Which is what you wanted.
But this kind of complication is why regular expressions are considered
overused. They are hard to get correct, particularly for people new to
them.
My personal approach would not be to convert dir(math)
into a single
string to scan for nondunder names. I would keep it as is (a list
) and
scan that list for nondunder names. You could classify the names in the
list using a regular expression somewhat as you are now, or use a
non-regular expression based classification with the string startswith
and endswith
methods.
Untested sketch:
commands_list = []
for name in dir(math):
if ... test that name is a nondunder name ...:
commands_list.append(name)
inserting your preferred text expression.
Cheers,
Cameron Simpson cs@cskk.id.au