Why I can't get any result from this

coverdaisy · September 13, 2022, 12:47pm

from lxml import etree

import requests

import fake_useragent

from fake_useragent import UserAgent

url=‘https://www.aladdin-e.com/zh_cn/chemicals-and-biochemicals/bioscience/biological-buffers.html’

headers=UserAgent().random

resp=requests.get(url,headers)

tree=etree.HTML(resp.text)

ul_list=tree.xpath(‘//div[@class=“products wrapper grid products-grid product-cate-grid”]//ol’)

product_list=

for ul in ul_list:

product_list+=ul.xpath(‘.//a/text()’)

print(product_list)

MRAB · September 13, 2022, 1:56pm

Have you checked resp.status_code? It might be reporting an error.

barry-scott · September 13, 2022, 3:02pm

Have you checked that resp.text has what you expect?

coverdaisy · September 13, 2022, 11:09pm

It says i don’t have the permission to access the url. Any suggestions？

cameron · September 14, 2022, 6:07am

By Coverdaisy via Discussions on Python.org at 13Sep2022 23:19:

屏幕截图 2022-09-14 0707001482×226 29.2 KB

It says i don’t have the permission to access the url. Any suggestions？

When you’ve got text output, please copy/paste it as text inline in your
message, between triple backticks, eg:

 ```
 output
 goes here
 ```

The same with programme code.

The above screenshot shot shows a 403 error response, which says that
the web server has refused access to the URL you’ve used. Usually that
means that you need some kind of authentication with the request.

Cheers,
Cameron Simpson cs@cskk.id.au

coverdaisy · September 14, 2022, 9:35am

Thx. Could you show me some ways to solve this?

cameron · September 14, 2022, 10:40pm

By Coverdaisy via Discussions on Python.org at 14Sep2022 09:45:

Cameron Simpson:

403

Thx. Could you show me some ways to solve this?

Not easily; this kind of thing is service specific. There are standard
ways to provide authentication, but which of them applies depends on the
service.

Note that buried in the response message is a suggestion that the
rejection was based on a blacklist (“denied by UA ACL = blacklist”),
which may mean that your IP address is forbidden from accessing this
URL. Not amount of authentication will help you there, if that preceeds
other checks.

However, you can test the basics from a command line. I just tried it
from here (in Australia) with your URL, and did not get a 403 error:

wget -S -O - 'https://www.aladdin-e.com/zh_cn/chemicals-and-biochemicals/bioscience/biological-buffers.html'
--2022-09-15 08:36:12--  https://www.aladdin-e.com/zh_cn/chemicals-and-biochemicals/bioscience/biological-buffers.html
Resolving www.aladdin-e.com (www.aladdin-e.com)... 203.107.45.179
Connecting to www.aladdin-e.com (www.aladdin-e.com)|203.107.45.179|:443... connected.
HTTP request sent, awaiting response...
   HTTP/1.1 200 OK
   Content-Type: text/html; charset=UTF-8
   Transfer-Encoding: chunked
   Connection: keep-alive
   Set-Cookie: aliyungf_tc=99fa7455478a2010e091a42868596289688ef11f4d2e52d32b4fe77c2ff737f4; Path=/; HttpOnly
   Set-Cookie: acw_tc=76b20f4e16631949741671755e52272bbd80e08c85528fbe66b753cca92661;path=/;HttpOnly;Max-Age=1800
   Server: nginx/1.14.1
   Vary: Accept-Encoding
   X-Powered-By: PHP/7.4.23
   Set-Cookie: PHPSESSID=pddcgkrucos14hn6t03f3biqn1; expires=Thu, 15-Sep-2022 02:36:14 GMT; Max-Age=14400; path=/; domain=www.aladdin-e.com; secure; HttpOnly; SameSite=Lax
   Pragma: no-cache
   Cache-Control: max-age=0, must-revalidate, no-cache, no-store
   Expires: Tue, 14 Sep 2021 13:46:57 GMT
   X-Content-Type-Options: nosniff
   X-XSS-Protection: 1; mode=block
   X-Frame-Options: ALLOWALL
   X-Cache: MISS
   Strict-Transport-Security: max-age=31536000
Length: unspecified [text/html]
Saving to: 'STDOUT'
<!doctype html>
<html lang="zh">
     <head >
         <script>
     var BASE_URL = 
'https\u003A\u002F\u002Fwww.aladdin\u002De.com\u002Fzh_cn\u002F';
     var require = {
         'baseUrl': 
'https\u003A\u002F\u002Fwww.aladdin\u002De.com\u002Fstatic\u002Ffrontend\u002FAladdin\u002Fcn\u002Fzh_Hans_CN'
     };</script>        <meta charset="utf-8"/>
<meta name="title" 
content="生物缓冲液-生化试剂-生物科学-阿拉丁aladdin"/>
<meta name="description" 
content="阿拉丁专业提供生物缓冲液,阿拉丁试剂网提供试剂、耗材一站式服务！"/>
<meta name="keywords" content="生物缓冲液,阿拉丁,阿拉丁试剂网"/>
<meta name="robots" content="INDEX,FOLLOW"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
<meta name="format-detection" content="telephone=no"/>
<meta name="renderer" content="webkit"/>
<meta name="" content="IE=edge,chrome=1"/>
<title>生物缓冲液-生化试剂-生物科学-阿拉丁aladdin</title>

and so on…

So it is possible that the blacklist is a local thing: does your HTTP
access go via a local proxy where you are?

Cheers,
Cameron Simpson cs@cskk.id.au

coverdaisy · September 15, 2022, 9:34am

I don’t know, how can I find out? Sorry to bother u.

barry-scott · September 15, 2022, 4:15pm

I use the debug console in my browser to see the network traffic and capture the set of requests and responses when I manually login and access a resource.

Then I turn that set of requests into code.

For example there may be a login form.
I would findout what the login form fields are and POST in the right format the fields with the values filled in in my python code.
Take the cookie, its usually a cookie, that comes back with the response and add that cookie to all later requests.

Barry

cameron · September 15, 2022, 9:44pm

By Coverdaisy via Discussions on Python.org at 15Sep2022 09:44:

I don’t know, how can I find out? Sorry to bother u.

Can you fetch that URL from a command prompt, eg:

 wget -S -O - 'https://www.aladdin-e.com/zh_cn/chemicals-and-biochemicals/bioscience/biological-buffers.html'

That can be quite informative for a basic test. I like wget, but
curl is also very popular.

Cheers,
Cameron Simpson cs@cskk.id.au