Hello,
I am a data analytics student and we’re looking at web crawlers/spiders. Instructor illustrated sample code to crawl wikipedia.com but I get the following error as shown in the attached image:
Can anyone help me with this?
TIA,
Rich
Hello,
I am a data analytics student and we’re looking at web crawlers/spiders. Instructor illustrated sample code to crawl wikipedia.com but I get the following error as shown in the attached image:
Can anyone help me with this?
TIA,
Rich
It’s missing a closing quote.
Hi.
You’ve mixed tabs and spaces for indentation.
I did this here:
I configured my editor to show unprintable characters. Do you notice this?
You should review your last or previous line (maybe others too).
Where, exactly, please?
My editor (Jupyter) does not allow that.
Moving my cursor to every space/character, I don’t see any evidence of an unprintable character.
Try:
Actually, it’s missing not just a closing quote.
Look at line 11:
listOfLinks.append("\"{0}\ ->\"{1}\";.format(masterLinkList)
I think that should be:
listOfLinks.append("\"{0}\" -> \"{1}\"").format(masterLinkList))
I think it should be:
listOfLinks.append("\"{0}\" -> \"{1}\"".format(masterLinkList))
Ah, correct!
Thanks everyone for your suggestions, but so far, nothing is working.
Can anyone explain how proper indentation works in Python?
Also, does the end of each line with an If statement require a colon “:” character? Required with nested loops also?
Thank You!
You should not be crawling Wikipedia: Wikipedia:Database download - Wikipedia
Quote:
Please do not use a web crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia.
You should also not randomly be crawling any internet sites, unless you respect the site’s policies and robot.txt (see for instance: https://en.wikipedia.org/robots.txt and if you don’t know what that is, see: robots.txt - Wikipedia).
I understand what you’re saying, but this is a simple, classroom assignment. This is not a serious effort to crawl, but only to demonstrate the process to us students.
I suggest you read this: Design and History FAQ — Python 3.13.3 documentation
If you want to go deeper: 2. Lexical analysis — Python 3.13.3 documentation
Yes, it does. The ‘:’ says to Python “I’ll start a nested block”
Maybe you need to study Python’s fundamentals before doing something bigger.
Look around the official tutorial.
I’m a firm believer in ethical classroom assignments though. Even if it’s “just an assignment”, it should be something that is legal, ethical, and not violating a site’s terms of service. All it would require is selecting a different site to scrape.
Some sites were built this purpose: https://www.scrapethissite.com/
Did not know about that, that’s perfect. I don’t know why some educators seem to think that the only way to teach these techniques is to use a famous site like Wikipedia.