Screen scraping

mitchmcc · January 14, 2022, 11:40pm

I have used Python and Beautiful Soup for several screen scraping projects in the past. But I have a more general question about how to get at a particular web site that I cannot figure out. I know this is not directly a Python question, but I am not sure where to ask it.
My past experience is like this. When I go to a database backed web page often there is a form that you fill in with some of the necessary information to do a lookup. When the information is returned, sometimes the URL shows exactly how to request that page using the form parameters as data, e.g.
www.testsite.com/ my-webpage?param1=username
So once you know this, you can bypass the form and build the output URL directly, then scrape the data.
The page I want is behind Microsoft, and returns something like
https://wwwtestsite.org/PublicAccess/Search.aspx?
ID=300&RefineSearch=1

IOW nothing in the URL indicates what I searched for, and gives me a way to ask directly about the parameter “username” shown above.

Am I completely barking up the wrong tree here? Or is it that I would need to do a more sophisticated analysis of the GET and RESP via the browser developer console or Wireshark?

So I am looking for very high level guidance about (I guess) how to figure out the code behind the page???

Thanks,
Mitch

CAM-Gerlach · January 15, 2022, 4:26am

I’m not a web expert, but there’s a lot of things it could be. If its just using a HTTP POST request rather than a GET (where you can see the query params in the URL), Wireshark or just your browser’s normal dev tools (which should be sufficient for most non-obfuscated methods) will show you what the POST payload is, which you can reproduce. Beyond that, it could be a script using AJAX requests, or sending some other sort of payload. It could be stored/retrieved in a cookie, or even just stored server-side. So it really depends on the specific page in question; replicating this sort of thing is pretty non-trivial, which is why most services that don’t mind you doing this will provide an API of some sort.

mitchmcc · January 15, 2022, 12:59pm

Thank for replying. I sort of understand how to use the dev console to figure out the exact request… I don’t mean to rant, but it just seems like this is another example of Microsoft over-engineering a basic Web request and making it difficult.
I am only 65 years old and will try to grow up now.