Long-term stable source to get news headlines with Python?

sandufi · December 22, 2021, 3:13pm

I am planning to do multiple data visualisations of the sentiment of real-time news headlines from different newspapers. The visualisations will be displayed on different sites (with pyhisical LED screens) for a long period. I am trying to find the best way with Python to get the news according to my needs:

The source has to be stable in its format because the piece its gonna last for a long period. That’s why I am not sure about considering web scraping.
I want to be able to have multiple instances(let’s say 10) of the visualisation each displaying different content. That’s why I am not sure about using Twitter API, due to limited requests and having to depend on a Twitter account.

The only option that comes to my mind is using RSS feeds as its something solid and easy to integrate with Python, but at the same time the use of RSS feeds is in decline and less digital newspapers are using them. I would like to have as much access to different newspapers as possible.

You guys know any other options or have any tips on how to handle this?

Thanks in advance, Joan.

steven.daprano · December 24, 2021, 4:19am

I don’t think that anyone can make promises on behalf of other
organisations regarding the stability of their website.

But some sites may be more stable than others. Any site that offers an
explicit API for fetching data (Wikipedia, Reddit, Twitter) is probably
not planning on massively changing things too soon. Likewise if they
have an RSS feed.

You might look at using the Wikipedia API to fetch their “In The News”
section from the front page.

Apart from being (potentially, if not actually) rude, using web scraping
to grab data from a website leaves you open to being blocked for abuse,
sued for copyright infringement, or arrested for “computer trespass”.
(And don’t think I am exaggerating the risk.) So web scraping should be
considered a last resort, and definitely not something that you do
lightly if there are alternatives. But see Beautiful Soup for a library
to help with that.

And remember to be a good net citizen by limiting your own rate. The
smaller your impact on the site, the less likely that they will
distinguish you from regular web browsing and the less they will care
that you are web scraping.

If you are worried about using the Twitter API and being rate-limited,
well, that’s exactly what the API is for: to make sure that people don’t
abuse their service. The alternative to using their API is web scraping,
see above.

If you want to have multiple visualisations, you can still do that. I
suggest you cast your net a bit wider: instead of grabbing a large
amount of news from ten sites, fetch a smaller amount from twenty
sites, mix it all together, and split it up between your ten
visualisations. If some sites limit the number of queries that you can
make, make up for it by adding extra sites.

You’ll need redundency from when sites are down, or quiet.

But ultimately, you need to do your own homework to see what sites will
either offer an API (under terms you can accept) or can be successfully
scraped. “What sites can I fetch news from” is not a Python question.