Using request package: Requested zipfiles are not opening and scripts are getting downloaded as HTML

hhadah · March 21, 2022, 7:29pm

Hi guys!

I am using the requests package to download zip files and scripts from a URL with authentication. I’m using the following code.

    import requests
    user, password = 'ID', 'Password'
    for year in range(1959, 2018):
        r = requests.get(f"https://data-nber.org.ezproxy/mortality/{year}/mort{year}.do", auth=(user, password))

When I use the code, I am getting zip files and codes. The zip files are not opening and the codes are just HTML codes and not the original code. Is there a way I can tweak the script to solve this problem?

Thanks!

ferdnyc · March 23, 2022, 3:47am

Hmm. I’m assuming this is proxying the data from https://data.nber.org/mortality/? (That doesn’t require authentication, but maybe your proxy does.) regardless, you need to show your entire code if you want us to find the problem, it isn’t likely to be in the couple of lines you posted.

But, for example, here’s how I successfully retrieved a working, valid .zip file from that service using Python:

>>> import requests
>>> r = requests.request(url='https://data.nber.org/mortality/1959/mort1959.zip', method='GET')
>>> r.status_code
200
>>> r.headers
{'Date': 'Wed, 23 Mar 2022 03:37:08 GMT', 'Server': 'Apache/2.4.52 (FreeBSD) PHP/7.4.27 OpenSSL/1.1.1k-freebsd mod_apreq2-20090110/2.8.0 mod_perl/2.0.11 Perl/v5.32.1', 'X-XSS-Protection': '1', 'Last-Modified': 'Tue, 06 Mar 2001 18:01:26 GMT', 'ETag': '"e039a4-37eda767ce980"', 'Accept-Ranges': 'bytes', 'Content-Length': '14694820', 'Vary': 'Origin', 'Access-Control-Allow-Credentials': 'true', 'Keep-Alive': 'timeout=20, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'application/zip'}
>>> len(r.content)
14694820
>>> with open("/tmp/file.zip", "wb") as f:
...     f.write(r.content)
... 
14694820
>>>

Writing the data to disk in binary mode is important, when dealing with binary content like .zip files. (The b in the second argument to open(..., "wb") does that.)

If you’re getting HTML documents, then you’re likely just not retrieving the correct files. For example, the actual URL you posted (if I translate it into a public URL) is just a log file of some process, it’s not any kind of useful code at all. You can view it in your browser to confirm that:

https://data.nber.org/mortality/1959/mort1959.do

I don’t know what that is, but it’s nothing you want or can use.

The entire contents of the server is browsable at https://data.nber.org/mortality/ , so you can look in the various directories and find the correct URLs to the files you need. But I suspect the issue isn’t with your code (well, maybe, but not only with your code), it also may be with what you’re requesting.

If your code is having issues, though, please post it. All of it, not four lines, unless you’re sure the only problem is in those four lines.

Edit: It’s also possible your proxy is messing up the content type of the data returned, you can look at the value of r.headers.get('Content-Type') after performing the request to determine that. For the .zip file I retrieved directly from the source server, it came back correct which tells me that the data wasn’t corrupted in transit by re-encoding it as something else:

>>> r.headers.get('Content-Type')
'application/zip'

For other types of content, like the .do file (if you do actually need that), if I request it from the source server the same way as the zip file, it comes through correctly as plaintext. But it’s possible the proxy is reencoding it as HTML on you.

If so, it may be possible to send a header that will tell it to preserve the plain-text encoding of the original data, but I’m not familiar enough with proxy requests to know the precise mechanics that would be involved.

It could also just be that the HTML response is an indication that your authentication request to the proxy is failing.