I’m not sure if this will help, but I’ve been reading up on
regex and I’ve coded this as a part of my notes:
dateString = 'the date in this string is May 20 2022'
year = re.search('[0-9][0-9][0-9][0-9]', dateString)
yearFound = (year.span())
print('Year not found')
As an explainer, for anyone following this and does not know:
- returns a match object if a match is found, otherwise it returns
In the above example, the match object is returned thus:
<_sre.SRE_Match object; span=(34, 38), match='2022'>
span=(34, 38) is the slice notation: start & end positions of ‘2022’ which is the same as
dateString[34:38]; that is to say that the match starts at character position 34 and extends up to (but not including) position 38.
The real power of
regex is when you need to pattern match. So in this example we need to pattern match four consecutive digits. We can do this by constructing a character class of metacharacters.
We can match any single character or a range of characters:
 would match ‘3’
[R] would match ‘R’ and so on. To match a range, we can use the metacharacter
-, so any digit between zero and nine would be
[0-9]. We need to find a four digit year, so
'[0-9][0-9][0-9][0-9]' means any string (notice that it’s enclosed with single quotes) of four digits between zero and nine, back-to-back:
We can then use the
.span() method to extract the
(34, 38) tuple and assign it to a variable:
yearFound = (year.span()) from which it can be unpacked with
[yearFound:yearFound] and displayed with
print(dateString[yearFound:yearFound]) which we can them include in the
The ‘gotcha’ here is that you’d need to know the format of the data to be searched.