Corporate Proxy and Web Scraping

mriepe-lab · June 19, 2020, 1:59pm

When web scraping I get blocked by my corporate proxy. This is the code I have so far:

import requests

s = requests.Session()
s.proxies = {
“http”: “http://user:username:password@123_proxy.com:80”,
“https”: “https://user:username:password@123_proxy.com:80”,
}
r = s.get("https://en.wikipedia.org/wiki/Tesla,_Inc.")

#start_url = 'https://en.wikipedia.org/wiki/Tesla,_Inc.’
start_url = r

Error

ProxyError: HTTPSConnectionPool(host=‘en.wikipedia.org’, port=443): Max retries exceeded with url: /wiki/Tesla,_Inc. (Caused by ProxyError(‘Cannot connect to proxy.’, OSError(‘Tunnel connection failed: 407 authenticationrequired’)))

aeros · June 21, 2020, 2:23am

At a glance, it seems like the issue might with the format you’re attempting to pass the authentication details in with. Specifically, you should try replacing user with your username, and password with your actual password, and remove the username part (so, two fields left of the @ instead of 3). E.g.

 s.proxies = {
"http": "http://my_username:my_password@123_proxy.com:80",
"https": "https://my_username:my_password@123_proxy.com:80",
}

Otherwise, it might have to do with how the proxy server is configured, but I would recommend trying the above first.

Edit: If your company’s proxy requires digest authentication instead of just basic user/pass authentication, I don’t believe this is directly supported by requests. However, there is a requests-toolbelt available on PyPI (maintained by core contributors to requests for utilities they think would better fit outside of the core library) that supports it. See the docs for HTTPProxyDigestAuth for usage information.

mriepe-lab · June 24, 2020, 1:24pm

Hi Kyle,

When changing to that format i get this error message:

OSError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
661 if is_new_proxy_conn:
–> 662 self._prepare_proxy(conn)
663

~\Anaconda3\lib\site-packages\urllib3\connectionpool.py in _prepare_proxy(self, conn)
947 conn.set_tunnel(self._proxy_host, self.port, self.proxy_headers)
–> 948 conn.connect()
949

~\Anaconda3\lib\site-packages\urllib3\connection.py in connect(self)
307 # self._tunnel_host below.
–> 308 self._tunnel()
309 # Mark this connection as not reusable

~\Anaconda3\lib\http\client.py in _tunnel(self)
920 raise OSError(“Tunnel connection failed: %d %s” % (code,
–> 921 message.strip()))
922 while True:

OSError: Tunnel connection failed: 407 authenticationrequired

During handling of the above exception, another exception occurred:

MaxRetryError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
448 retries=self.max_retries,
–> 449 timeout=timeout
450 )

~\Anaconda3\lib\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
719 retries = retries.increment(
–> 720 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
721 )

~\Anaconda3\lib\site-packages\urllib3\util\retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
435 if new_retry.is_exhausted():
–> 436 raise MaxRetryError(_pool, url, error or ResponseError(cause))
437

MaxRetryError: HTTPSConnectionPool(host=‘en.wikipedia.org’, port=443): Max retries exceeded with url: /wiki/Tesla,_Inc. (Caused by ProxyError(‘Cannot connect to proxy.’, OSError(‘Tunnel connection failed: 407 authenticationrequired’)))

During handling of the above exception, another exception occurred:

ProxyError Traceback (most recent call last)
in
10 }
11
—> 12 r = s.get("https://en.wikipedia.org/wiki/Tesla,_Inc.")
13
14 #start_url = 'https://en.wikipedia.org/wiki/Tesla,_Inc.’

~\Anaconda3\lib\site-packages\requests\sessions.py in get(self, url, **kwargs)
544
545 kwargs.setdefault(‘allow_redirects’, True)
–> 546 return self.request(‘GET’, url, **kwargs)
547
548 def options(self, url, **kwargs):

~\Anaconda3\lib\site-packages\requests\sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
531 }
532 send_kwargs.update(settings)
–> 533 resp = self.send(prep, **send_kwargs)
534
535 return resp

~\Anaconda3\lib\site-packages\requests\sessions.py in send(self, request, **kwargs)
644
645 # Send the request
–> 646 r = adapter.send(request, **kwargs)
647
648 # Total elapsed time of the request (approximately)

~\Anaconda3\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
508
509 if isinstance(e.reason, _ProxyError):
–> 510 raise ProxyError(e, request=request)
511
512 if isinstance(e.reason, _SSLError):

ProxyError: HTTPSConnectionPool(host=‘en.wikipedia.org’, port=443): Max retries exceeded with url: /wiki/Tesla,_Inc. (Caused by ProxyError(‘Cannot connect to proxy.’, OSError(‘Tunnel connection failed: 407 authenticationrequired’)))

aeros · June 25, 2020, 2:01am

After doing some further investigation into this issue, it seems that requests does not directly support digest authentication over proxies, which could possibly be the cause of the 407 error if your company’s corporate proxy is expecting digest authentication instead of basic authentication. However, there is support for this in requests-toolbelt, which is maintained by the requests developers for additional utilities that that they’ve deemed useful, but too niche for requests. See the docs for HTTPProxyDigestAuth for details, it looks very simple to use.

If the above solution doesn’t work, I would recommend getting in contact with your system administrators to determine the form of authentication required, and to verify your credentials. Without knowing any of the configuration details of the proxy server, I’m just guessing at possible solutions.