Python: Extract Data from an Interactive Map on the Web, with Several Years, into a CSV file

Hi Python Community,

I hope you are doing well. I have a quite challenging task to do (or at least for me, as I am novice in Python).

I need some help to extract huge amounts of data from an interactive map on a website and put them into a .csv file, and would like to know if there is a way to extract all the data from it.

The map in question is here.

  1. I would like to know if there is a way to extract all the municipalities’ data, please?

I thank you in advance for your help.
Best wishes,

Michael

Edit: Here is what I tried, but in vein, to extract those data into Python data frame. I know that the data that I want are there:

{
                                                                                                        "type": "MultiPolygon",
                                                                                                        "arcs": [
                                                                                                            [
                                                                                                                [
                                                                                                                    -15410,
                                                                                                                    15526,
                                                                                                                    -15491,
                                                                                                                    -15236,
                                                                                                                    -15379
                                                                                                                ]
                                                                                                            ]
                                                                                                        ],
                                                                                                        "properties": {
                                                                                                            "NAMEUNIT": "<strong>Municipio: Villabrázaro\u003c/strong><br/>",
                                                                                                            "Unitario": "Precio unitario medio : 564 Euros/m<sup>2\u003c/sup><br/>",
                                                                                                            "Precio unitario medio del municipio": 564,
                                                                                                            "desviacion": "Desviacion tipica: 369<br/>",
                                                                                                            "superficie": "Superficie media: 311 m<sup>2\u003c/sup><br/>",
                                                                                                            "moda": "Rango de precio mas frecuente: 200-400 Euros/m<sup>2\u003c/sup><br/>",
                                                                                                            "poblacion": "Poblacion: 239<br/>",
                                                                                                            "renta_persona": "Renta media por persona: 13487 Euros/año<br/>",
                                                                                                            "renta_hogar": "Renta media por hogar : 31180 Euros/año<br/>"
                                                                                                        }
                                                                                                    },
                                                                                                    {
                                                                                                        "type": "MultiPolygon",
                                                                                                        "arcs": [
                                                                                                            [
                                                                                                                [
                                                                                                                    -15348,
                                                                                                                    -15345,
                                                                                                                    -15267,
                                                                                                                    -13840,
                                                                                                                    -13292
                                                                                                                ]
                                                                                                            ]
                                                                                                        ],
                                                                                                        "properties": {
                                                                                                            "NAMEUNIT": "<strong>Municipio: Villaescusa\u003c/strong><br/>",
                                                                                                            "Unitario": "Precio unitario medio : 580 Euros/m<sup>2\u003c/sup><br/>",
                                                                                                            "Precio unitario medio del municipio": 580,
                                                                                                            "desviacion": "Desviacion tipica: 660<br/>",
                                                                                                            "superficie": "Superficie media: 242 m<sup>2\u003c/sup><br/>",
                                                                                                            "moda": "Rango de precio mas frecuente: 100-200 Euros/m<sup>2\u003c/sup><br/>",
                                                                                                            "poblacion": "Poblacion: 235<br/>",
                                                                                                            "renta_persona": "Renta media por persona: 11123 Euros/año<br/>",
                                                                                                            "renta_hogar": "Renta media por hogar : 21876 Euros/año<br/>"
                                                                                                        }
                                                                                                    },
and so on...

I tried the following chunk, but obtained an error that I cannot solve:

import requests
import pandas as pd

# URL of the web page containing the data
url = "https://www.cohispania.com/wp-content/uploads/2024/01/mapa-2023.html"

# Fetch the HTML content
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Extract JSON data from HTML content
    json_data_list = response.json()
    
    # List to store extracted data for each municipality
    extracted_data = []

    # Iterate over each JSON object representing a municipality
    for json_data in json_data_list:
        # Extract relevant information from the JSON data
        properties = json_data["properties"]
        name = properties["NAMEUNIT"].split(":")[1].strip().replace("</strong><br/>", "")
        precio_unitario_medio = properties["Precio unitario medio del municipio"]
        desviacion = properties["desviacion"].split(":")[1].strip().replace("<br/>", "")
        superficie = properties["superficie"].split(":")[1].strip().replace(" m<sup>2</sup><br/>", "")
        moda = properties["moda"].split(":")[1].strip().replace("</sup><br/>", "")
        poblacion = properties["poblacion"].split(":")[1].strip().replace("<br/>", "")
        renta_persona = properties["renta_persona"].split(":")[1].strip().replace(" Euros/año<br/>", "")
        renta_hogar = properties["renta_hogar"].split(":")[1].strip().replace(" Euros/año<br/>", "")

        # Append extracted data to the list
        extracted_data.append({
            "Name": name,
            "Precio unitario medio": precio_unitario_medio,
            "Desviacion": desviacion,
            "Superficie": superficie,
            "Moda": moda,
            "Poblacion": poblacion,
            "Renta media por persona": renta_persona,
            "Renta media por hogar": renta_hogar
        })

    # Create DataFrame from the extracted data
    df = pd.DataFrame(extracted_data)

    # Display the DataFrame
    print(df)
else:
    print("Failed to fetch data from the web page")

  • Could anyone please give me a hand with this?
  1. What error did you get?

  2. For the example input that you show, exactly what result do you want to get? What are the rules that the code should use, to calculate that?

Hi @kknechtel ,

Thank you for reaching out. Below are the answers to your points. Thank you for your willingness to help:

  1. Here are the errors obtained:
---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
File ~\AppData\Local\anaconda3\Lib\site-packages\requests\models.py:971, in Response.json(self, **kwargs)
   970 try:
--> 971     return complexjson.loads(self.text, **kwargs)
   972 except JSONDecodeError as e:
   973     # Catch JSON-related errors and raise as requests.JSONDecodeError
   974     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError

File ~\AppData\Local\anaconda3\Lib\json\__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
   343 if (cls is None and object_hook is None and
   344         parse_int is None and parse_float is None and
   345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
   347 if cls is None:

File ~\AppData\Local\anaconda3\Lib\json\decoder.py:337, in JSONDecoder.decode(self, s, _w)
   333 """Return the Python representation of ``s`` (a ``str`` instance
   334 containing a JSON document).
   335 
   336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
   338 end = _w(s, end).end()

File ~\AppData\Local\anaconda3\Lib\json\decoder.py:355, in JSONDecoder.raw_decode(self, s, idx)
   354 except StopIteration as err:
--> 355     raise JSONDecodeError("Expecting value", s, err.value) from None
   356 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

JSONDecodeError                           Traceback (most recent call last)
Cell In[17], line 13
    10 # Check if the request was successful
    11 if response.status_code == 200:
    12     # Extract JSON data from HTML content
---> 13     json_data_list = response.json()
    15     # List to store extracted data for each municipality
    16     extracted_data = []

File ~\AppData\Local\anaconda3\Lib\site-packages\requests\models.py:975, in Response.json(self, **kwargs)
   971     return complexjson.loads(self.text, **kwargs)
   972 except JSONDecodeError as e:
   973     # Catch JSON-related errors and raise as requests.JSONDecodeError
   974     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
--> 975     raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)
  1. I would like to obtain a data frame in Python, in which each column represents one variable:
    • (so NAMEUNIT for the first column, Unitario for the second one, and so on).

    • For each row, I would like to extract the data from it: Villabrázaro for the first column, 564 Euros/m^2 for the second one, and so on, for all the municipalities available on the HTML.

I apologize, it is the first time I am trying to extract data from a map, so maybe I am not clear.
Thank you for your help!

Are you sure the response contains a valid JSON text? Maybe the service responds with a HTML content (because there was an error)?

Hi @FelixLeg,

Thank you for reaching out.

You are true: I am not sure about the validity of the JSON. To be honest, I don’t know how I can really know that it. Sorry, I am still novice.

The data that I want is always in this form, with changing names and values:

 "geometries": [
                                                {
                                                    "type": "MultiPolygon",
                                                    "arcs": [
                                                        [
                                                            [
                                                                0,
                                                                1,
                                                                2
                                                            ]
                                                        ]
                                                    ],
                                                    "properties": {
                                                        "NAMEUNIT": "<strong>Municipio: Mélida\u003c/strong><br/>",
                                                        "Unitario": "Precio unitario medio : 729 Euros/m<sup>2\u003c/sup><br/>",
                                                        "Precio unitario medio del municipio": 729,
                                                        "desviacion": "Desviacion tipica: 398<br/>",
                                                        "superficie": "Superficie media: 190 m<sup>2\u003c/sup><br/>",
                                                        "moda": "Rango de precio mas frecuente: 200-400 Euros/m<sup>2\u003c/sup><br/>",
                                                        "poblacion": "Poblacion: 714<br/>",
                                                        "renta_persona": "Renta media por persona: 14526 Euros/año<br/>",
                                                        "renta_hogar": "Renta media por hogar : 34162 Euros/año<br/>"
                                                    }},

Thank you for your help!

Well, I think the only way to test if this is a JSON is to call response.json() and catch any Exceptions with try: ... except:... construction :slight_smile:

Anyway, try to load the “JSON” data manually (using your browser or a tool like wget or curl) and check the contents

From what you posted it is almost a JSON, but with errors. Your data has one issue: it is a list of JSONs not one entry.

They always do return JSONs like this?

Yes, the data are always in this form, except for municipalities with missing data:

[
                                                                        "<strong>Municipio: Valderrebollo<\/strong><br/>Precio unitario medio: No disponible<br/>Desviacion tipica: No disponible<br/>Superficie media: No disponible<br/>Rango de precio mas frecuente:  No disponible<br/>Poblacion: 23<br/>Renta media por persona: 14898 Euros/año<br/>Renta media por hogar: 25639 Euros/año",
                                                                        "<strong>Municipio: Mirafuentes<\/strong><br/>Precio unitario medio: No disponible<br/>Desviacion tipica: No disponible<br/>Superficie media: No disponible<br/>Rango de precio mas frecuente:  No disponible<br/>Poblacion: 59<br/>Renta media por persona: 15444 Euros/año<br/>Renta media por hogar: 38614 Euros/año",
                                                                        "<strong>Municipio: Valdesotos<\/strong><br/>Precio unitario medio: No disponible<br/>Desviacion tipica: No disponible<br/>Superficie media: No disponible<br/>Rango de precio mas frecuente:  No disponible<br/>Poblacion: 27<br/>Renta media por persona: 14898 Euros/año<br/>Renta media por hogar: 25639 Euros/año",
...
],

Can you give me an example URL of that service?

Can you give me an example URL of that service?

What do you mean? The full map is here:

leaflet (cohispania.com)

Sorry I’s typing too fast :blush: I wanted the URL from what you get the JSON data. I want to check the data on my own.

Ok, I understand.

Basically, the Leaflet map is located in the URL file provided in #9. Then I just type Ctrl + U in Windows (or in Mac format Option + Command + U if not mistaken) to have the full source page.

Thank you in advance for your help!

O…M…G… the page is so big that none of my editors wants to load them :astonished:

I think that you request from the response.json() too much. If the data is embedded inside a HTML file then json() doesn’t know how to parse it. It is not a magical tool, it can’t extract JSON on its own. I think you have to write some a lot of code that extracts the JSON first on your own. :frowning:

Hi again Przemysław,

  • So that’s why response.json() was giving me an error:

The data is embedded inside a HTML file then json() doesn’t know how to parse it

  • Their code is really vague. Could this be a strategy to dissuade people from owning their data?

  • Do you think extraction is possible still, please?

Thanks again for your help, time and generosity @FelixLeg!

I wish you luck, however from what I’s able to get from the site source, it may be hard. Whoever wrote the source of the site he/she has made everything to make it impossible to extract any data :frowning: The code is obfuscated and very big.

@FelixLeg: Thank you again for your amazing help.

Probably. It is a consultancy firm in Spain, so I imagine that they made it difficult to extract data on purpose…

I wish you all the best. Again, thank you!

1 Like

If anyone has a way around this problem, I’m more than happy to “hear” it.

Thank you!

Okay, I found a way to extract JSON from the site :smiley:

import urllib.request as req
import html.parser as hp

resp = req.urlopen("https://www.cohispania.com/wp-content/uploads/2024/01/mapa-2023.html")
html_cont = resp.read().decode('utf8')

class html_parser(hp.HTMLParser):
	
	def __init__(self):
		super().__init__()
		
		self.inside_data = False
		self.the_data = ""
	
	def handle_starttag(self, tag, attrs):
		if not self.inside_data:
			if tag == 'script':
				for k,v in attrs:
					if k == 'type' and v == 'application/json':
						self.inside_data = True
						return
	
	def handle_endtag(self, tag):
		if self.inside_data:
			if tag == 'script':
				self.inside_data = False
	
	def handle_data(self, data):
		if self.inside_data:
			self.the_data += data

parser = html_parser()
parser.feed(html_cont)
the_json = parser.the_data
parser.close()

#optional: write to file
with open('result.json', 'w') as f:
	f.write(the_json)

Now it is up to you what you want to do it with the data :wink:

2 Likes

This is completely wrong. First off, .json() doesn’t do that - it expects the entire data from the request to be a single chunk of JSON data. You can’t use it for a web page because the data is in a completely different format. Second, the HTML content in this case doesn’t contain the JSON you want anyway. Instead, it contains JavaScript code that your browser uses to make AJAX calls to get the JSON data.

Please read:

1 Like

Hi @FelixLeg,

I just want to thank you so much for your help in this! It works perfectly well.

I have learned a lot with you in this thread! Thanks for sharing your knowledge.

Hope to “hear” from you again in one of mine future posts.

All the best,

Michael

Hi @kknechtel,

Thank you for your clarifications about what I can do or mot with HTML and JavaScript.

Sorry for my mistakes, I thought it was possible the way I presented it.

I also learned more with you! Thank you very much to help me to become a little bit more knowledgeable on HTML. Again, thank you so much.

All the best,