Fastest way to scrap website

baptistenamur · July 22, 2023, 5:38pm

Hello there,
my goal is to scrap a lot of different urls the fastest way possible.
I am currently using selenium (chrome driver) with threadpool (I can go for 8 threads at the same time). I also tried request with proxies but I don’t know if that’s faster.
I can’t use aiohttp or scrapy.
If someone knows how to scrap VERY fast websites I am open to any suggestions.
Thank you

flyinghyrax · July 22, 2023, 6:37pm

*scrape ^[1]
Web scraping is against the terms of service of some websites. Be a good citizen of the web and check for authorization.
You should be aware of robots.txt.
When automating web requests, the limiting factor will always end up being the actual time it takes for a request to complete. As you’ve noticed, you can perform several requests concurrently to increase throughput.
If you aren’t interested in asyncio, then your gold standard on the client end is probably “requests” plus “beautifulsoup4” with threads for concurrent requests, which sounds close to what you’re already doing.
On the good citizen front, you’ll want to limit the number of requests per time period that you make to the same domain. That is if you’re making concurrent requests, have all the requests going to different sites so you aren’t hammering someone’s server. It’s rude.

In general, the way to approach good-faith web scraping is to rate limit yourself, spread the load, and honor what a site asks (e.g. in robots.txt). If your only goal is to go “as fast as possible” in terms of requests per second, I’d suggest looking for a different project, because you’re probably abusing someone else’s web infrastructure that they literally pay money for.

English is weird. ↩︎