
Why is Python (using JupyterLab) giving me a “rofile,” instead of a “Profile?”
I’ve run it a few times, and I get the same result each time.

Why is Python (using JupyterLab) giving me a “rofile,” instead of a “Profile?”
I’ve run it a few times, and I get the same result each time.
Clearly because start_index is one further along than you expected. You are giving zero context about what the rest of the code is doing and why do you expect Profile, so I can’t be any more helpful.
In [1]: #Amos, D. 2012-2024. A Practical Introduction to Web Scraping in Python.
In [2]: #The urllib file contains tools for working with URLs, and the urllib.req
In [3]: #Part 1: Accessing and printing a page's HTML via the urlopen script.
In [4]: #the script, "from urllib.request import urlopen" imports urlopen()
In [5]: from urllib.request import urlopen
In [6]: #Here, name the variable, "url". The variable will be the web page to be
In [7]: url="http://olympus.realpython.org/profiles/aphrodite"
In [8]: #Once named, the variable, "url", Python must pass it to "urlopen()", and
In [9]: page = urlopen(url)
In [10]: #The HTTPResponse object's .read() method extracts HTML from the page and
In [11]: html_bytes=page.read()
In [12]: #Using .decode() decodes the bytes to a string, using UTF-8.
In [13]: html=html_bytes.decode("utf-8")
In [14]: # Using print() prints the HTML page contents to screen.
In [15]: print(html)
<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>
In [ ]:
In [16]: #Part 2: Extract text from HTML with string methods.
In [17]: #String methods extract information from a web page's HTML.
In [18]: #It's first necessary to know the index of the first character of the tit
In [20]: title_index=html.find('title')
In [21]: title_index
Out[21]:
15
In [22]: #the string, .find searches through HTML text for <title> tags and extrac
In [23]: start_index=title_index+len("<title>")
start_index
Out[23]:
22
In [ ]: #Next, get the index of the closing </title> tag by passing the string "<
In [24]: end_index=html.find("</title>")
end_index
Out[24]:
39
In [25]: #Finally, extract the title by slicing the html string:
In [27]: title=html[start_index:end_index]
title
Out[27]:
In [ ]:
In [ ]:
In [ ]:
'rofile: Aphrodite'`Preformatted text`
You search for the string "title", but then you add the length of "<title>", which is longer. Also the location where title starts has already skipped past the opening <.
Searching for “title” may not be a great idea in the first place, as it will find the first occurrence of that string, which may not be the HTML title tag. It also won’t find it if the tag is in uppercase, for instance. You’re likely better off using an HTML parser library if you want to extract structural bits of HTML.
Thank you, Brendan, for the thoughtful reply. I will modify the string, as you suggested.
I agree that using an HTML parser library would be a better solution for extracting information, but I’m working with a very limited skill set, just trying to follow the instructions in a tutorial. Hopefully, I’ll be able to do that sort of thing soon.
Thanks again.
Beautiful soup is a well-known package for handling HTML, and has lots of tutorials and such out there.
It’s not part of the Python standard library, but the standard library does have html.parser.HTMLParser, which is intuitive and powerful. Have a look at the example, it probably just needs slight modification for your purpose here
Thank you, Will. I will take that advice to heart!