Python Script Failing to Process Data from My Website (Timeouts and Parsing Errors)

Hi everyone, I’ve been working on a small project for my website, which focuses on restaurant-related content like menus, offers, and reviews. I use a Python backend script to scrape, process, and update certain parts of the website’s data (mainly pulling prices, item descriptions, and user feedback from my own database). Everything was working fine until recently, when the script started timing out and throwing parsing errors during execution.

The issue began after I made some updates to the site’s structure — specifically, I switched to using more dynamic JavaScript-based rendering for my menu pages. Now, whenever my Python script tries to fetch those pages using requests or urllib, it receives incomplete HTML, which breaks my BeautifulSoup parser. I tried using requests-html and even selenium to load dynamic content, but that caused the script to slow down significantly and occasionally hang.

In addition to that, I’m also facing random connection resets when the script runs multiple requests concurrently. I’m using aiohttp for asynchronous fetching, and while it’s faster, sometimes I get ClientOSError: [Errno 54] Connection reset by peer. I checked my hosting provider’s limits and made sure I wasn’t hitting any request rate caps. It seems like something in the async handling or session reuse is causing instability.

Another thing I noticed is that the data processing part of the script (where it writes updates to a CSV or uploads via API to my website’s backend) occasionally fails silently. When I log the responses, I see some entries being skipped, even though the input data is valid. I added try-except blocks and proper logging, but I still can’t pinpoint where the process is breaking. It’s frustrating because it doesn’t always fail at the same point—it’s random.

I’ve also tried breaking the script into smaller chunks and running it in batches, but then I run into authentication issues. My website API requires a token that expires every hour, and reauthenticating mid-process sometimes causes mismatched session headers. I’m using the requests session object to persist authentication, but it seems to expire too early when dealing with async tasks.

Has anyone dealt with similar issues when trying to scrape or process dynamic website content using Python? I’d appreciate advice on optimizing asynchronous requests, handling session tokens for long-running scripts, and best practices for parsing JS-rendered HTML without switching to full Selenium automation. I’d also love to know if there’s a more modern approach (maybe Playwright or httpx) that handles these issues better than what I’m using now. Sorry for the long post

If I understand you it is html you have written.
I would assume that you can fix the html to be parsable?

Honestly, you could try a bunch of different little tricks or hacks, but the actual solution is to use Playwright which uses an actual browser to render the web page the way a browser does. Once you have JS doing a bunch of stuff to the front-end, grabbing the raw HTML is always going to be hit or miss.

Thanks for the suggestion that actually makes a lot of sense. I was trying to avoid Playwright at first because I wanted to keep the script lightweight, but given how much of my site’s menu data now depends on JavaScript rendering, it’s probably the most reliable approach. I’ll start experimenting with Playwright’s async API to see if I can balance accuracy with performance.

You’re right that scraping static HTML just doesn’t cut it anymore once the frontend gets too dynamic. Hopefully, using a headless browser will help me capture the fully rendered content without all the parsing errors I’ve been running into. Appreciate the clear advice — this gives me a solid direction to move forward with!