Issue with converting to integer

Irons1989 · February 13, 2023, 8:07pm

Hello everyone,

I’m totally new to Python, Ive been taking lessons for the last two weeks so, please excuse this silly question.

Here’s my issue:

I have a variable named months to which I have stored a string with all the months and their corresponding days. It goes something like this months=‘’’ January (‘31 days)
February (28 days)…’‘’
etc.

Now, using list comprehension, I’m trying to create a new list which will contain the numbers of the months list. I wrote this:

month_days=[int(x.split(“(”)[1]) for x in months.split(“\n”)]

What I don’t understand is why I get an error saying:

ValueError: invalid literal for int() with base 10: ‘31 days)’

It only works if I add [:2] in the code like this:

month_days=[int(x.split(“(”)[1[:2]]) for x in months.split(“\n”)]

Can someone explain simply why does this happen?

bryevdv · February 13, 2023, 9:05pm

The best way to figure things out in situations like this is to take the individual pieces apart and look at the actual data at each step:

In [1]: months="""\
   ...: January (31 days)
   ...: February (28 days)
   ...: """

In [3]: months.split("\n")
Out[3]: ['January (31 days)', 'February (28 days)', '']

In [4]: x = 'January (31 days)' 

In [5]: x.split("(")[1]
Out[5]: '31 days)'

so your code ends up calling int('31 days)') which fails because the string '31 days)' does not contain (only) the string representation of any integer. The int function does not know what to with the part after the characters “31” so it errors out.

So what is the value when you include [:2]? Easy enough to check in the REPL:

In [7]: x.split("(")[1][:2]
Out[7]: '31'

That string is actually a string representation of an integer, so calling int on it can succeed.

All that said, just putting [:2] in the code there is maybe not the most robust solution, although I guess technically it will work in the single-digit-followed-by-a-space case, since int will ignore whitespace:

In [10]: int("3 ")
Out[10]: 3

rob42 · February 13, 2023, 9:08pm

First: there are no silly questions, when one is new to a subject, so don’t go there; it’s fine

If you add the code line print(month_days) you’ll be able to see that list object and how it is indexed. You understand that Python uses zero indexing right?

To add: there are better ways to do what you are trying to do, such as using a dictionary object:

month_days = {
    "January": 31,
    "February": 28,
    "March": 31,
    # and the rest 
    }

… but I’m sure you can see what would happen, given a leap year, but even that short coming can be overcome with some coding. There’s an even better way by coding some functions, but that’s a ways down the road for you right now.

cameron · February 13, 2023, 9:54pm

I have a variable named months to which I have stored a string with all
the months and their corresponding days. It goes something like this
months=‘’’ January (‘31 days)
February (28 days)…’‘’
etc.

Please post code (and data) between “code fences”, lines of triple
backticks. They preserve indentation and punctuation. Example:

 ```
 your code
 goes here
 ```

There’s a </> button in the compose window to make one of these.

Now, using list comprehension, I’m trying to create a new list which will contain the numbers of the months list. I wrote this:
month_days**=[int(x.split("(")[1]) for x in months.split("\n")]
What I don’t understand is why I get an error saying:
ValueError: invalid literal for int() with base 10: '31 days)
It only works if I add [:2] in the code like this:
month_days=[int(x.split("(")[1[:2]]) for x in months.split("\n")]

Basicly, split is not a great tool for what you’re trying to do. Let’s
look in detail.

Your months seems to be a multiline string with leading a trailing
whitespace on the lines, eg:

 January (31 days)
     February (28 days)

The split method breaks a string on the supplied characters. So you
start with months.split("\n"), which gets you the lines above as
separate strings for use in your list comprehension.

Considering the string " January (31 days)", your next expression is
this:

 int(x.split("(")[1])

where x holds the string. In pieces:

x.split("("): there’s just one opening bracket in the string, so you
get a list containing 2 strings: ["January ", "31 days)"]

[1]: this selects the second string: "31 days)".

The int() function expects a bare decimal integer (or a couple of
other forms, but they are all “bare” integers with no additional junk).
By “additional junk” I mean, for this string, the text " days)".

See: Built-in Functions — Python 3.12.1 documentation

The simplest incremental change at this point is to notice that the
number part is separated from the junk by a space, and to split it
again.

Your current string is made by x.split("(")[1]. Is we call split()
again we get a list of 2 strings: ["31","days)"]. So this:

 x.split("(")[1].split()[0]

would (a) do an additional split() and then select the first string
from that result, which should be "31". And that is a valid string
for use with int().

You can see that this expression is getting pretty horrible. You can
tease it apart, but what it’s doing is obscure. You might be better
building the list incrementally. Untested example:

 days = []
 for x in months.split("\n"):
     month, day_count_etc = x.split("(")
     day = day_count_etc.split()[0]
     days.append(int(day))

I’m usually loathe to recommend this (because regexps are very
overused) but a regular expression is the common way to parse things out
of simple regular text.

I want to reinforce that regexps are intended for recognising simple
pieces of regularly formed text, and have their own syntax for this
which avoids complicated stuff like your .split() calls above. But
regexps are also cryptic and it is easy to make mistakes in complicated
expressions. Anyway, …

Here’s a regexp which matches your examples:

 \W+ \(\d+ days\)

You can see that this is already cryptic. However, it also has the fixed
bits of the text eg days in the clear right in front of you.

The above matches: a “word”, a space, an opening bracket, some digits, a
space, the text “days” and a closing bracket. You’re interested in the
digits, so we’ll embellish a bit:

 \W+ \((\d+) days\)

which introduces a “group” (the brakcets around the \d+ part) so that
we can access it later.

Taking this expression in pieces we’ve got:

\W means a "word character, a letter or a digit
\W+ means 1 or more word characters, thus a “word”
a space is a space
\( means an open bracket. In regexps the brackets indicate grouping, so we’re “escaping” our bracket with a backslash to indicate that we really just mean a bracket character.
( this unescaped bracket starts a “group”, a section of the expression we want to remeber for use later; here we’re going to remeber the number of days
\d means a digit
\d+ means one or more digits, thus a “number”
) a closing bracket, ending the “group”; this means we will rememebr the “number” part of the expression
a space is a space
days is also just the text days
\) an escaped closing bracket, because we mean just a bracket character

In Python code (again, completely untested):

 import re  # <- at the top of your script

 match_day = r'\W+ \((\d+) days\)'
 match_day_re = re.compile(match_day)

 days = []
 for x in months.split("\n"):
     m = match_day_re.search(x)
     if not m:
         print("no match on text:", x)
     else:
         n = int(m.group(0))
         days.append(n)

There are many things to remark on in this code:

We import the re module at the start of our script; import
statements generally go at the top of scripts. This gets us access to
Python’s regular expression module.

 import re  # <- at the top of your script

Then we define the regular expression we’ll be using to match the text:

 match_day = r'\W+ \((\d+) days\)'
 match_day_re = re.compile(match_day)

To define match_day we’re using a “raw string”, which starts with
r'. This means that backslashes inside the string are not special.

You’ll remember splitting your string up on newlines with split("\n").
That splits on a string containing a single newline character, which is
_represented in Python as the 2 characters \ and n, which Python
sees and recognises as indicating a newline character.

Regexps also use backslashes heavily, so to avoid mixing them up we
use Python’s “raw string” syntax which keeps those characters as is.

Inside the loop we look for our regexp days =

 m = match_day_re.search(x)

This searches the string in x for our pattern and returns a regexp
Match object or None, the common Python placeholder for “no value”.

 if not m:
     print("no match on text:", x)

If the pattern wasn’t seen, we take this branch to complain.

 else:
     n = int(m.group(0))
     days.append(n)

Otherwise we access the first group as m.group(0), which should
contain the number of days. That is a string, and we still need to
convert that string to an int, which we do.

References:
The int function: Built-in Functions — Python 3.12.1 documentation
The re module`: re — Regular expression operations — Python 3.12.1 documentation

Cheers,
Cameron Simpson cs@cskk.id.au

MRAB · February 13, 2023, 10:43pm

FYI, \w (lowercase) matches a word character, \W (uppercase) matches a non-word character. Compare with \d (lowercase) that matches a digit and \D (uppercase) that matches a non-digit.

cameron · February 13, 2023, 11:46pm

Gah! Yes of course. Sorry for the error. - Cameron Simpson cs@cskk.id.au

Irons1989 · February 14, 2023, 3:34pm

Thank you very much for the time you took to reply and for the detailed answer. I wrote the code that way because the teacher in the lessons I’ve been taking wrote it that way, so I was trying to understand it as is.

I get it now,thank you all very much!