I am needing a property list generated for Multifamily Properties in the St Louis Missouri area. I would be looking for a complete list of all duplex, tri plex, fourplex and 5+ Multifamily Properties within the ST LOUIS MSA. Data I would be needing is Property Address, Square Footage, Unit Count, Owner Name, Owner Address, Owner Phone # and Owner E-mail. I decided to scrape through different website for fetching this information - Realtor.com & Zillow.com, etc. I am using Python in Pycharm IDE. Making use of Beautifulsoup. I am stuck while paginating. Getting 308 error when reading from the URL I am dynamically forming. Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
import urllib.error
#from bs4 import BeautifulSoup
import ssl
import json
import ast
import os
from urllib.request import Request, urlopen
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter Url- ')
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, 'html.parser')
links = []
for link in soup.findAll('a'):
try:
if(str(link.get('href')).__contains__("realestateandhomes-detail")):
url = str(url).replace('realestateandhomes-search','')
newURL = url[0:len(url) - 13] + link.get('href')
print(newURL)
print('1')
req1 = Request(newURL, headers={'User-Agent': 'Mozilla/5.0'})
print('2')
webpage1 = urlopen(req1).read()
print('3')
soup1 = BeautifulSoup(webpage1, 'html.parser')
for x in soup1.findAll('bed'):
print(str(x))
except Exception as e:
print(str(e))
When we run this script it asks for the URL. I enter “St. Louis, MO Real Estate - St. Louis Homes for Sale | realtor.com®”. And then we can see this script failing every time it tries to run “urlopen”. Have included those print statements for debugging.
The for loop is used for forming the pagination URL dynamically. Requesting the URL formed works fine. But, it throws 308 error when I try doing this from Python. Is this because of security restrictions on the website? Or am I missing anything?
I tried suggestions made in these links. But, they were not of much help - Scraping realtor data with beautifulsoup
How to scrape page with pagination with python BeautifulSoup