Web scraping from scientific database

shktmhs2020 · February 24, 2020, 8:59am

Hello. I am learning python, and clearly I don’t know what I am doing. But what I want to do is to use web scraping to extract the information I need from a scientific database like Uniprot.

For example, I would like to extract certain information from this link (https://www.uniprot.org/uniprot/Q15116). I would like to save the table information of “Topology”, “Molecule processing” and “Region”.

I read the following article from RealPython website (https://realpython.com/beautiful-soup-web-scraper-python/). It looked easy enough, but this is over my head.

Please help. I need to do similar operation many times. Thank you.

dboddie · February 25, 2020, 10:24am

Please consider using the online API to retrieve the information you need instead of scraping web pages to get it. Maintainers of services like these provide the API to make it easy to retrieve the information in a machine-readable form. The HTML pages are meant for humans to read, not programs.

If you scrape the information you need from a page then you will may to translate it into a usable format and this may break if the site changes the way that the information is displayed in the future.

Also, maintainers of sites like these may well consider programmatic scraping of their web pages to be antisocial, especially if you are going to do it many times. This may lead to your IP address being blocked, which won’t help you get the information you need.

If you require a very large amount of the data from the site then you should talk to the maintainers of the site. It may be possible to obtain underlying datasets more efficiently than requesting large amounts of data over the network in lots of small requests.

shktmhs2020 · February 28, 2020, 6:45am

Thank you, dboddie. I don’t know what I am doing, but… wow. When I am better at this API, the quality of my life will be drastically improved Awesome!