Hi everyone, I’ve been working on a small project for my website, which focuses on restaurant-related content like menus, offers, and reviews. I use a Python backend script to scrape, process, and update certain parts of the website’s data (mainly pulling prices, item descriptions, and user feedback from my own database). Everything was working fine until recently, when the script started timing out and throwing parsing errors during execution.
The issue began after I made some updates to the site’s structure — specifically, I switched to using more dynamic JavaScript-based rendering for my menu pages. Now, whenever my Python script tries to fetch those pages using requests or urllib, it receives incomplete HTML, which breaks my BeautifulSoup parser. I tried using requests-html and even selenium to load dynamic content, but that caused the script to slow down significantly and occasionally hang.
In addition to that, I’m also facing random connection resets when the script runs multiple requests concurrently. I’m using aiohttp for asynchronous fetching, and while it’s faster, sometimes I get ClientOSError: [Errno 54] Connection reset by peer. I checked my hosting provider’s limits and made sure I wasn’t hitting any request rate caps. It seems like something in the async handling or session reuse is causing instability.
Another thing I noticed is that the data processing part of the script (where it writes updates to a CSV or uploads via API to my website’s backend) occasionally fails silently. When I log the responses, I see some entries being skipped, even though the input data is valid. I added try-except blocks and proper logging, but I still can’t pinpoint where the process is breaking. It’s frustrating because it doesn’t always fail at the same point—it’s random.
I’ve also tried breaking the script into smaller chunks and running it in batches, but then I run into authentication issues. My website API requires a token that expires every hour, and reauthenticating mid-process sometimes causes mismatched session headers. I’m using the requests session object to persist authentication, but it seems to expire too early when dealing with async tasks.
Has anyone dealt with similar issues when trying to scrape or process dynamic website content using Python? I’d appreciate advice on optimizing asynchronous requests, handling session tokens for long-running scripts, and best practices for parsing JS-rendered HTML without switching to full Selenium automation. I’d also love to know if there’s a more modern approach (maybe Playwright or httpx) that handles these issues better than what I’m using now. Sorry for the long post