Hi Mark,
You said:
“It is documentation written by codesmiths for codesmiths.”
Well, yes
Have you considered doing the tutorial? That is documentation written
for beginners and may help you a lot.
https://docs.python.org/3/tutorial/index.html
Are you familiar with regular expressions in other languages? Python
regexes are pretty close to identical to the standard regex syntax used
by most languages. You just need to adjust to a few minor differences.
One minor difference compared to, say, Perl, is that regexes do not have
their own special type. Regexes in Python are just strings. So if you
know how to write a string:
food = "cheese"
then you know how to write a regular expression pattern:
pattern = "cheese" # Matches the exact string 'cheese'.
The only tricky part is that Python strings, by default, interpret
backslash as an escape sequence for control characters. So:
newline = '\n'
is a one character string, not two, containing an ASCII newline, not
backslash followed by d. That can clashes with the use of actual
backslashes in regular expressions:
match_digits = '\d+' # Match one or more digits. (Maybe...)
Without actually trying it, I don’t know whether that will be:
The rules for when backslashes are interpreted as escape sequences for
control characters (like newline, tab, carriage return etc) and when
they are left in are unfortunately confusing.
There are two easy ways to fix that:
# Double-up the backslashes so that they are escaped.
match_digits = '\\d+' # Match one or more digits.
# Use a so-called "raw string".
match_digits = r'\d+' # Match one or more digits.
Both of those will result it match_digits being the string:
which can be used as a regular expression pattern to match one or more
digits.
Raw strings disable the special interpretation of backslash escape
sequences, so that backslashes become just a plain old character like
the rest. So while it is not compulsory, nearly everyone using Python
prefers to use raw strings for regex patterns. It just makes it easier.
So in this example from the docs, we have this function call:
re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
r'static PyObject*\npy_\1(void)\n{',
'def myfunc():')
Let’s break it down. The first argument is the pattern to match, the
second is the replacement string, and the third is the text to match
against:
target = r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):'
replacement = r'static PyObject*\npy_\1(void)\n{'
text = 'def myfunc():'
re.sub(target, replacement, text)
which returns:
'static PyObject*\npy_myfunc(void)\n{'
(Personally, I think this is a horribly complicated example for
beginners to be exposed to in the docs. Oh well.)
Let’s analyse the target. The leading r’ tells the interpreter that this
is a string, and to treat backslash as an ordinary character, not a
control character escape sequence. So we have the pattern:
def # Literally the letters d e f
\s+ # one or more spaces
( # begins a group
[a-zA-Z_] # a letter a...z, A...Z or underscore
[a-zA-Z_0-9]* # zero or more letters, underscores or digits
) # ends the group
\s* # zero or more spaces
\( # a literal left-bracket (
\s* # zero or more spaces
\) # a literal right-bracket )
: # a colon
So that’s the regex pattern being searched for. The replacement is:
static PyObject* # literally what you see
\n # newline
py_ # literally the letters p y underscore
\1 # the contents of Group 1 in the matched text
(void) # literally what you see
\n # another newline
{ # and an left-brace {
The text being searched is “def myfunc():”. It matches the regex, with
the word “myfunc” matching group 1. So the regular expression index
extracts out the group “myfunc” and inserts it where the backslash-1 is,
giving us the resulting text:
# This original
def myfunc():
# becomes this
static PyObject*
py_myfunc(void)
{
Regular expressions are a special language of their own. They are
complex and powerful, and (aside from some minor differences) shared by
most programming languages, but also painfully terse. If you have never
used them before, I can understand the culture shock you are
experiencing.