I’m not sure if this will help, but I’ve been reading up on regex
and I’ve coded this as a part of my notes:
import re
dateString = 'the date in this string is May 20 2022'
year = re.search('[0-9][0-9][0-9][0-9]', dateString)
if year:
print('Year found')
yearFound = (year.span())
print(dateString[yearFound[0]:yearFound[1]])
else:
print('Year not found')
As an explainer, for anyone following this and does not know:
import re
re.search(<regex>, <string>)
- returns a match object if a match is found, otherwise it returns
None
In the above example, the match object is returned thus: <_sre.SRE_Match object; span=(34, 38), match='2022'>
span=(34, 38)
is the slice notation: start & end positions of ‘2022’ which is the same as dateString[34:38]
; that is to say that the match starts at character position 34 and extends up to (but not including) position 38.
The real power of regex
is when you need to pattern match. So in this example we need to pattern match four consecutive digits. We can do this by constructing a character class of metacharacters.
We can match any single character or a range of characters: [3]
would match ‘3’ [R]
would match ‘R’ and so on. To match a range, we can use the metacharacter -
, so any digit between zero and nine would be [0-9]
. We need to find a four digit year, so '[0-9][0-9][0-9][0-9]'
means any string (notice that it’s enclosed with single quotes) of four digits between zero and nine, back-to-back: re.search('[0-9][0-9][0-9][0-9]', dateString)
We can then use the .span()
method to extract the (34, 38)
tuple and assign it to a variable: yearFound = (year.span())
from which it can be unpacked with [yearFound[0]:yearFound[1]]
and displayed with print(dateString[yearFound[0]:yearFound[1]])
which we can them include in the if
branch.
The ‘gotcha’ here is that you’d need to know the format of the data to be searched.