Parsing a formatted string

kvvikram · March 3, 2021, 3:02pm

I am very new to Python and hence the question might sound very naive.
i tried to use regex but that also does not help.
I have a string with “PT2H21M12.666667S”. I need to extract the numbers for each field.
i.e. 2 21 and 12.666667 separately.

Can you please help me correct way to extract the values

Thanking you in advance

pepoluan · March 4, 2021, 11:30am

What regex have you tried?

kvvikram · March 4, 2021, 1:22pm

first_period_start = “PT2H21M12.666667S”

print(first_period_start)

start_time = first_period_start[2:]

print(start_time)
//Works till here. O/p is “2H21M12.666667S”
pattern=re.compile(‘HMS’)

result = pattern.findall(start_time)

print( result )

pepoluan · March 5, 2021, 7:24am

The regex pattern of "HMS" will match exactly the string "HMS". It will not match "H21M12S" or anything else which has additional characters in between “H”, “M”, and “S”

You need to use capturing groups, wildcards, and quantifiers.

kvvikram · March 5, 2021, 8:06am

I have been able to solve it in a different way. Using find for the character ‘H’, ‘M’, ‘S’ and ‘.’ characters and then splitting the string.

Please have a look at the below code snippet.

start_time = time_str[2:]

    if "H" in start_time:

        hr_val = start_time.split("H", 1)

        hrs = int(hr_val[0])

        min_str = hr_val[1]

    else:

        hrs = 0

        min_str = start_time

    if "M" in min_str:

        min_val = min_str.split("M", 1)

        mins = int(min_val[0])

        sec_str = min_val[1]

    else:

        mins = 0

        sec_str = min_str

    if "S" in sec_str:

        msec_str = sec_str.split("S", 1)[0]

        if "." in msec_str:

            msec_val = msec_str.split(".", 1)

            secs = int(msec_val[0])

            msec = int(msec_val[1])

        else:

            secs = int(msec_str)

            msec = 0

    else:

        secs = 0

        msec = 0

I was wondering if there would be a better way to solve the problem.

pepoluan · March 5, 2021, 8:20am

Regex with “named capturing groups”.

kvvikram · March 5, 2021, 8:44am

As I have already mentioned, I am new to python programming.
Can you please share some sample using Regex with “named capturing groups”.

cameron · March 5, 2021, 9:28am

This is actually pretty good. (People will recommend regexps, but they
can be cryptic and complex ones lend themselves to accidents by
mismatching or failing to match.)

The code above makes a number of assumptions:

that time_str starts with “PT” and therefor that you can skip the
first 2 characters.
That the H, M and S parts occur in that order.

Both those things seem reasonable, and perhaps guarenteed by what is
supplying your string. But it is worth commenting this in the code, for
example:

# time_str always starts with "PT", skip that
start_time = time_str[2:]

You can streamline the if-test and split like this:

try:
    hr_part, after_hr = start_time.split("H", 1)
except ValueError:
    hrs = 0
else:
    hrs = int(hr_part)
    start_time = after_hr

This relies on the fact that split() raises a ValueError if there’s no
“H” (i.e. this is invalid because it cannot be split). So you can skip
the explicit test for “H” in start_time by catching the exception.

The “test first” with the if-statement is sometimes called “look before
you leap”. The try/except form, where you attempt the function and catch
a failure is sometimes called “ask for forgiveness” - try it and recover
if it isn’t allowed.

The other thing to notice is that I put the tail of the string after “H”
back into start_time. That way your next step alwso works on start_time,
and that makes the code more regular and easier to understand.

Cheers,
Cameron Simpson cs@cskk.id.au

cameron · March 5, 2021, 9:47am

Pandu is probably seeking to avoid providing the code directly to you,
as writing the code yourself teaches better.

That said, the documentation for Python regular expressions is here:

https://docs.python.org/3/library/re.html#module-re

The “Regular Expression Syntax” part, about 4 paragraphs down, describes
the syntax. The basic deal is as follows:

Aside from some punctuation, most ordinary text matches itself. So a
regexp for “PT” is just “PT”.

The punctuals, referred to as “special characters” in the documentation
above, provides the “pattern” part of regular expressions.

Some common items include:

“*” Zero or more of the preceeding item.

“+” One or more of the preceeding item.

“?” Zero or one of the preceeding item i.e. an optional item.

Generally the preceeding item is a single character, or some other
construct. So this:

AH?

means exactly one “A” followed by an optional “H”. When you need to
apply these to larger things you can group text like this:

(AH)?

That means zero or one of “AH”, instead of just the “H”.

The re module remembers the locations of these groups when it matches a
string, so you can for example refer to the third group.

For added convenient it is possible to give groups names, like this:

(?P<somename>AH)?

That lets you refer to the group by the name “somename” after a match.

In addition to the module documentation above there’s a HOWTO here:

https://docs.python.org/3/howto/regex.html

That should get you started.

Cheers,
Cameron Simpson cs@cskk.id.au

kvvikram · March 5, 2021, 12:09pm

Well this is so clean and clear. Implemented and works well.
Will looks into the documentation you have provided for RegEx and see if I am able to achieve the same.

Thanks a ton