Locating elements in web page

kyle · September 30, 2023, 10:40am

hey everyone I have created a script to grab names of mall from the specified URL. The only problem am facing is going to the next page… kindly help how I can navigate to the next page. If there is another way to go to the next page you can suggest.

hansgeunsmeyer · September 30, 2023, 2:41pm

This sounds like web-scraping. Most sites do not allow web-scraping. Have you confirmed you are allowed to do this?

kyle · September 30, 2023, 3:31pm

Hey Hans, yes am web scraping, am allowed I’ve scraped the first 10 malls on the first page… it’s only that I dont know how to locate that next button

hansgeunsmeyer · September 30, 2023, 6:44pm

Ok - Can you provide a reference for the permission?
So, assuming you wrote the paste bin code yourself, I do not really understand how it is that you have this question. Even if you did not write the code, it should still not be a problem to figure out a way, since the Dubai map website seems to be using a very consistent way of constructing URLs…?

kyle · October 1, 2023, 11:42am

Hey Hans, I wrote it myself… I know it may seem trivial maybe especially to you but imagine I can’t get my head around it, have you seen the code I have commented those are the trials I was trying, some are XPATH or CSS selectors… like I can’t see the consistent way of constructing URLs the way you seeing it and that’s what is challenging me at the moment locating those elements and it sucks… so if you can assist it would be nice mehn.

Rosuav · October 1, 2023, 11:57am

Can you confirm that it’s legal, first?

Suletta-Majo · October 1, 2023, 3:34pm

I think this is the official bulletin board, so
I think anyone can’t or shouldn’t answer anything that violates the law
Although I am not interested at the moment, I knew that python is substantial not only in the AI field but also as an environment for scraping, so it was a surprising reaction, but
So I think it’s a more suitable question for informal places such as reddit
It seems like a problem that anyone who does scraping would solve on their own, but as information I know
I love to extend Chrome with uautopagerize, autopagerize, and other things that automatically load the “next page”
It seems that it is supported for each site
There may be hints in the source etc.
Autopagerize confirmed that the source of the fork is on the net

rob42 · October 1, 2023, 10:33pm

You should be able find the links that you want your scraper to follow, from the landing page. For example, the website may have the likes of <a class="btn" href="https://wwww.example.com/page2" title="This is page 2"> Next Page ❯ </a>.

You just have to figure out how the site works and code your scraper accordingly, as every site will (as likely as not) use a different format, but it should be all there: if you can see it in your web browser, you can find it in the html source and have your scraper follow whatever links you want it to follow.

kyle · October 2, 2023, 7:53am

Hey guys, so was consulting with somebody the other day and he gave me a nice solution. A logical solution as opposed to locating the next page button… the thing was to count how may products are being rendered per page, coz there is an element(label) showing all the malls that have been found almost at the top of the page… and I was able to see the pattern in the URL to get to the next page…

looping through all pages

for page_number in range(1, total_pages + 1):
    # sleep randomly between each page
    sleep_duration = random.uniform(2, 5)
    # Sleep for the random duration
    time.sleep(sleep_duration)
    
    # Construct the page URL
    Next_page = f"{base_url}/page/{page_number}/{coordinates}"
    # Next_page = f'https://2gis.ae/dubai/search/Malls/page/{page_number}?m=55.199034%2C25.118495%2F11'

kyle · October 2, 2023, 8:01am

Hey rob, yeah figuring out was where I couldn’t get my head around it… coz I thought it would be easier to locate the next btn, but how to get to it was the challenge. I know it might seem trivial but at time we see things differently some can see patterns quick fast lol or much better way of getting things done… nywy I got it working.

rob42 · October 2, 2023, 8:11am

You’ve coded your scraper a little different to the the way that I would have, but then almost every scraper needs to be tailored to the site on which it will be used, so I’m guessing that you had your reasons for doing things the way that you have: it’s all about extracting as much intel from the html source code as you can and then using that intel to achieve your goal.

kyle · October 2, 2023, 9:18am

Hey yeah mine is always to find a working solution first then I’ll figure out improvements much later or somebody might suggest another way when they check out my code… ummh if its not too much of a hassle and you have time you can share how you would extract maybe the names. Coz was told, sometimes I should check the API if any first… so suggestions are great and guys have a way of looking at things in a much better way.

rob42 · October 2, 2023, 9:41am

Indeed: if there’s an API, then happy days; use it. With sites that do not provide an API, I tend to craft my scraper in a way that mimics (as close as is practical) the way that a web browser works, in so much as, I include request header information (in the same way that a regular web browser does).

I also build in delays, so that I don’t generate a tonne of requests which could cause other users of the target site to experience performance issues of even a DOS; just be respectful about the load that you’re placing on the target web server.

To add: I’m not going to go all-in here and post a complete working solution, more an example of a typical ‘get request’ build:

import requests
from bs4 import BeautifulSoup

headers = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/118.0'
}

html_file = 'ws_output.html'
url = "" # whatever landing page you choose
print(f"Using: {url}")
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
print(f"Writing to : {html_file}")
with open(html_file, mode='w', encoding='UTF-8') as output:
    print(soup.prettify(), file=output)

exit("Finished.")

That’s typically my starting point. From there, I’ll inspect the html file that has been saved and craft my code to suit my needs.

kyle · October 2, 2023, 11:31am

Nice… aha so more like make a request , get the response then work with the response(the soup object).

rob42 · October 2, 2023, 11:59am

Yes, that’s the idea: once we have our soup variable, we can do quite a lot with it. Here, I’ve used the .prettify() method so that I can get a handle on the HTML structure. Once we have that, the code can be further developed to target particular HTML elements.

One of the more straightforward ways, is to use the soup.find() or soup.find_all() methods, to search the DOM for certain elements, such as ‘anchor tags’, maybe of a particular ‘class’. And passing the class_ argument (note the underscore) with some ‘criteria’ allows us to filter by class name: soup.find_all("a" class_="criteria"), for example.

kyle · October 2, 2023, 12:42pm

finding ratings

    ratings_wrap = content.find_all('div', {'_1emmab1'})
    for rating in ratings_wrap:
        rating_text = rating.find('div', {'class': '_jspzdm'}).text
        rate_text = rating.find('div', {'class': '_y10azs'}).text
        time.sleep(1)
        print(rating_text)
        print(rate_text)

That’s what I’ve been doing… did you have a look at the code I posted earlier(the pastebin). but the class_ caught my attention. Aint it the same way am finding the divs with the values of classes specified?

rob42 · October 2, 2023, 1:01pm

I had a very brief look at your code, so I’ve no real idea about how it works. I simply noted that it’s not how I would have coded a scraper, but I’m not saying that it’s wrong; it’s just not how I would do it, is all. For example I’ve no clue about why you have the braces, as in .find('div', {'class': '_jspzdm'}).text or why there’s a .text at the end, but if it’s working, then who am I to tear it apart: I’m no expert.

kyle · October 2, 2023, 1:21pm

Oh Lol… if it works don’t touch it lol… nywy
The idea behind the curly braces is just a dictionary defining the attrs, associated with the an element (in our case the ‘div’ ). so it searches for div with a class(key) where the value is ‘_jspzdm’. Then the text is just to get the text part of it, in this case 246 ratings.
<div class="_jspzdm">246 ratings</div>
Does it make sense

rob42 · October 2, 2023, 1:38pm

Ah, I see what you’re doing now, ty.

Possibly, BeautifulSoup is doing something akin to that, under the hood, so to speak, with the class_ argument.
¯\_(ツ)_/¯

rob42 · October 2, 2023, 3:28pm

@kyle

I’ve been having a little play with this and this code seems to do what your code (above) does:

jspzdm_list = soup.find_all(class_="_jspzdm")
for item in jspzdm_list:
    print(item.text)
    print(item.previous)

footnote: I think you’ve used content, whereas I’ve used soup