Is there some magic to allow reading/accessing files via a url with a cloudflare proxy hostname.
I started to get 403 forbidden on tests which have worked for a long time. I’m using this simple code
from urllib.request import urlopen
data = urlopen(uri).read()
this works if uri has a dns only host, but fails if the cloudflare proxy is involved.
Seems that simple code needs to become something like this and send a header
headers = {“User-Agent”: “YourCustomUserAgent”}
request = Request(url, headers=headers)
response = urlopen(request)
if response.status==200:
data = response.read()
else:
raise ValueError(f’cannot read {url}’)
If you download it with curl do you need the -L flag (or whatever it is), to follow redirects?
Is it full “Shields Up! Repel boarders!” hide-my-server proxy, or only their DNS Proxy (the minimum for analytics) too? Presumably bot protection etc. is turned off?
A lot of these problems are handled by features requests includes out of the box, instead of having to write your own work arounds when using urllib directly.
If that still doesn’t work, but the link works in browser, and the end goal is to get the tests working again, perhaps to get rid of the false negative, and for long term stability, the tests should be refactored to use browserstack, or a headless browser. OR something, at the very least so Cloudflare sees something that’s more like an actual user’s browser, and less like a web scraper.
I downloaded without any -L flag; I know my boss is enthusiastic about cloudflare; I not so much.
The failing host is proxied, the working one is dns only.
The working code in my second post just seems to add a user agent header. I think any reasonable bot would work that out. The surprise is that this old code worked for several months after the changeover to cloudflare without any issue. I assume cloudflare changes constantly to resist new threats so I suppose someone added a rule somewhere that changed the outcome.
1 Like