Why "rofile," instead of "Profile"

HAL9000 · July 13, 2024, 3:10pm

Screenshot from 2024-07-13 09-07-07

Why is Python (using JupyterLab) giving me a “rofile,” instead of a “Profile?”

I’ve run it a few times, and I get the same result each time.

MegaIng · July 13, 2024, 3:17pm

Clearly because start_index is one further along than you expected. You are giving zero context about what the rest of the code is doing and why do you expect Profile, so I can’t be any more helpful.

HAL9000 · July 13, 2024, 8:23pm

In [1]: #Amos, D. 2012-2024. A Practical Introduction to Web Scraping in Python.
In [2]: #The urllib file contains tools for working with URLs, and the urllib.req
In [3]: #Part 1: Accessing and printing a page's HTML via the urlopen script.
In [4]: #the script, "from urllib.request import urlopen" imports urlopen()
In [5]: from urllib.request import urlopen
In [6]: #Here, name the variable, "url". The variable will be the web page to be
In [7]: url="http://olympus.realpython.org/profiles/aphrodite"
In [8]: #Once named, the variable, "url", Python must pass it to "urlopen()", and
In [9]: page = urlopen(url)
In [10]: #The HTTPResponse object's .read() method extracts HTML from the page and
In [11]: html_bytes=page.read()
In [12]: #Using .decode() decodes the bytes to a string, using UTF-8.
In [13]: html=html_bytes.decode("utf-8")
In [14]: # Using print() prints the HTML page contents to screen.
In [15]: print(html)
<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>
In [ ]:
In [16]: #Part 2: Extract text from HTML with string methods.
In [17]: #String methods extract information from a web page's HTML.
In [18]: #It's first necessary to know the index of the first character of the tit
In [20]: title_index=html.find('title')
In [21]: title_index
Out[21]:
15
In [22]: #the string, .find searches through HTML text for <title> tags and extrac
In [23]: start_index=title_index+len("<title>")
start_index
Out[23]:
22
In [ ]: #Next, get the index of the closing </title> tag by passing the string "<
In [24]: end_index=html.find("</title>")
end_index
Out[24]:
39
In [25]: #Finally, extract the title by slicing the html string:
In [27]: title=html[start_index:end_index]
title
Out[27]:
In [ ]:
In [ ]:
In [ ]:
'rofile: Aphrodite'`Preformatted text`

BrenBarn · July 13, 2024, 8:31pm

You search for the string "title", but then you add the length of "<title>", which is longer. Also the location where title starts has already skipped past the opening <.

Searching for “title” may not be a great idea in the first place, as it will find the first occurrence of that string, which may not be the HTML title tag. It also won’t find it if the tag is in uppercase, for instance. You’re likely better off using an HTML parser library if you want to extract structural bits of HTML.

HAL9000 · July 13, 2024, 9:13pm

Thank you, Brendan, for the thoughtful reply. I will modify the string, as you suggested.

I agree that using an HTML parser library would be a better solution for extracting information, but I’m working with a very limited skill set, just trying to follow the instructions in a tutorial. Hopefully, I’ll be able to do that sort of thing soon.

Thanks again.

will_f · July 14, 2024, 1:41am

Beautiful soup is a well-known package for handling HTML, and has lots of tutorials and such out there.

It’s not part of the Python standard library, but the standard library does have html.parser.HTMLParser, which is intuitive and powerful. Have a look at the example, it probably just needs slight modification for your purpose here

HAL9000 · July 14, 2024, 11:19pm

Thank you, Will. I will take that advice to heart!