I have a variable named months to which I have stored a string with all
the months and their corresponding days. It goes something like this
months=‘’’ January (‘31 days)
February (28 days)…’‘’
etc.
Please post code (and data) between “code fences”, lines of triple
backticks. They preserve indentation and punctuation. Example:
```
your code
goes here
```
There’s a </>
button in the compose window to make one of these.
Now, using list comprehension, I’m trying to create a new list which will contain the numbers of the months list. I wrote this:
month_days**=[int(x.split("(")[1]) for x in months.split("\n")]
What I don’t understand is why I get an error saying:
ValueError: invalid literal for int() with base 10: '31 days)
It only works if I add [:2]
in the code like this:
month_days=[int(x.split("(")[1[:2]]) for x in months.split("\n")]
Basicly, split
is not a great tool for what you’re trying to do. Let’s
look in detail.
Your months
seems to be a multiline string with leading a trailing
whitespace on the lines, eg:
January (31 days)
February (28 days)
The split
method breaks a string on the supplied characters. So you
start with months.split("\n")
, which gets you the lines above as
separate strings for use in your list comprehension.
Considering the string " January (31 days)"
, your next expression is
this:
int(x.split("(")[1])
where x
holds the string. In pieces:
x.split("(")
: there’s just one opening bracket in the string, so you
get a list containing 2 strings: ["January ", "31 days)"]
[1]
: this selects the second string: "31 days)"
.
The int()
function expects a bare decimal integer (or a couple of
other forms, but they are all “bare” integers with no additional junk).
By “additional junk” I mean, for this string, the text " days)"
.
See: Built-in Functions — Python 3.11.2 documentation
The simplest incremental change at this point is to notice that the
number part is separated from the junk by a space, and to split it
again.
Your current string is made by x.split("(")[1]
. Is we call split()
again we get a list of 2 strings: ["31","days)"]
. So this:
x.split("(")[1].split()[0]
would (a) do an additional split()
and then select the first string
from that result, which should be "31"
. And that is a valid string
for use with int()
.
You can see that this expression is getting pretty horrible. You can
tease it apart, but what it’s doing is obscure. You might be better
building the list incrementally. Untested example:
days = []
for x in months.split("\n"):
month, day_count_etc = x.split("(")
day = day_count_etc.split()[0]
days.append(int(day))
I’m usually loathe to recommend this (because regexps are very
overused) but a regular expression is the common way to parse things out
of simple regular text.
I want to reinforce that regexps are intended for recognising simple
pieces of regularly formed text, and have their own syntax for this
which avoids complicated stuff like your .split()
calls above. But
regexps are also cryptic and it is easy to make mistakes in complicated
expressions. Anyway, …
Here’s a regexp which matches your examples:
\W+ \(\d+ days\)
You can see that this is already cryptic. However, it also has the fixed
bits of the text eg days
in the clear right in front of you.
The above matches: a “word”, a space, an opening bracket, some digits, a
space, the text “days
” and a closing bracket. You’re interested in the
digits, so we’ll embellish a bit:
\W+ \((\d+) days\)
which introduces a “group” (the brakcets around the \d+
part) so that
we can access it later.
Taking this expression in pieces we’ve got:
-
\W
means a "word character, a letter or a digit
-
\W+
means 1 or more word characters, thus a “word”
- a space is a space
-
\(
means an open bracket. In regexps the brackets indicate grouping, so we’re “escaping” our bracket with a backslash to indicate that we really just mean a bracket character.
-
(
this unescaped bracket starts a “group”, a section of the expression we want to remeber for use later; here we’re going to remeber the number of days
-
\d
means a digit
-
\d+
means one or more digits, thus a “number”
-
)
a closing bracket, ending the “group”; this means we will rememebr the “number” part of the expression
- a space is a space
-
days
is also just the text days
-
\)
an escaped closing bracket, because we mean just a bracket character
In Python code (again, completely untested):
import re # <- at the top of your script
match_day = r'\W+ \((\d+) days\)'
match_day_re = re.compile(match_day)
days = []
for x in months.split("\n"):
m = match_day_re.search(x)
if not m:
print("no match on text:", x)
else:
n = int(m.group(0))
days.append(n)
There are many things to remark on in this code:
We import the re
module at the start of our script; import
statements generally go at the top of scripts. This gets us access to
Python’s regular expression module.
import re # <- at the top of your script
Then we define the regular expression we’ll be using to match the text:
match_day = r'\W+ \((\d+) days\)'
match_day_re = re.compile(match_day)
To define match_day
we’re using a “raw string”, which starts with
r'
. This means that backslashes inside the string are not special.
You’ll remember splitting your string up on newlines with split("\n")
.
That splits on a string containing a single newline character, which is
_represented in Python as the 2 characters \
and n
, which Python
sees and recognises as indicating a newline character.
Regexps also use backslashes heavily, so to avoid mixing them up we
use Python’s “raw string” syntax which keeps those characters as is.
Inside the loop we look for our regexp days =
m = match_day_re.search(x)
This searches the string in x
for our pattern and returns a regexp
Match
object or None
, the common Python placeholder for “no value”.
if not m:
print("no match on text:", x)
If the pattern wasn’t seen, we take this branch to complain.
else:
n = int(m.group(0))
days.append(n)
Otherwise we access the first group as m.group(0)
, which should
contain the number of days. That is a string, and we still need to
convert that string to an int
, which we do.
References:
The int
function: Built-in Functions — Python 3.11.2 documentation
The re
module`: re — Regular expression operations — Python 3.11.2 documentation
Cheers,
Cameron Simpson cs@cskk.id.au