How to transform a JSON file into a CSV one in Python?

Hi Everyone,

I have a quick question. Thanks to a previous post :

Python: Extract Data from an Interactive Map on the Web, with Several Years, into a CSV file

I was been able to extract JSON data from the web. Thanks again to @FelixLeg & @kknechtel to their useful help and advices. This could not be possible without their insightful comments. Many thanks to them again.

But now, I would like to extract those data into a CSV file, without taking into account the specific HTML jargon (for example <br> / </br>, etc.) that I have in the original JSON file.

I would like to have for each column the variable name (Precio unitario medio, Desviacion tipica, Superficie media, Rango de precios frequente, Poblacion, Renta media por persona and Renta media por hogar). As in the original map.

I tried the following chunk with pandas library, but obtain a systematic error:

import pandas as pd

with open('result.json', encoding='utf-8') as inputfile:
    df = pd.read_json(inputfile)

df.to_csv('csvfile.csv', encoding='utf-8', index=False)

Here is the error I obtained:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[20], line 4
      1 import pandas as pd
      3 with open('result.json', encoding='utf-8') as inputfile:
----> 4     df = pd.read_json(inputfile)
      6 df.to_csv('csvfile.csv', encoding='utf-8', index=False)

File ~\AppData\Local\anaconda3\Lib\site-packages\pandas\io\json\_json.py:760, in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, precise_float, date_unit, encoding, encoding_errors, lines, chunksize, compression, nrows, storage_options, dtype_backend, engine)
    757 if convert_axes is None and orient != "table":
    758     convert_axes = True
--> 760 json_reader = JsonReader(
    761     path_or_buf,
    762     orient=orient,
    763     typ=typ,
    764     dtype=dtype,
    765     convert_axes=convert_axes,
    766     convert_dates=convert_dates,
    767     keep_default_dates=keep_default_dates,
    768     precise_float=precise_float,
    769     date_unit=date_unit,
    770     encoding=encoding,
    771     lines=lines,
    772     chunksize=chunksize,
    773     compression=compression,
    774     nrows=nrows,
    775     storage_options=storage_options,
    776     encoding_errors=encoding_errors,
    777     dtype_backend=dtype_backend,
    778     engine=engine,
    779 )
    781 if chunksize:
    782     return json_reader

File ~\AppData\Local\anaconda3\Lib\site-packages\pandas\io\json\_json.py:862, in JsonReader.__init__(self, filepath_or_buffer, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, precise_float, date_unit, encoding, lines, chunksize, compression, nrows, storage_options, encoding_errors, dtype_backend, engine)
    860 elif self.engine == "ujson":
    861     data = self._get_data_from_filepath(filepath_or_buffer)
--> 862     self.data = self._preprocess_data(data)

File ~\AppData\Local\anaconda3\Lib\site-packages\pandas\io\json\_json.py:874, in JsonReader._preprocess_data(self, data)
    872 if hasattr(data, "read") and not (self.chunksize or self.nrows):
    873     with self:
--> 874         data = data.read()
    875 if not hasattr(data, "read") and (self.chunksize or self.nrows):
    876     data = StringIO(data)

File <frozen codecs>:322, in decode(self, input, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 552: invalid continuation byte

Does anyone has an idea how can I circumvent this problem, please?

Thank you very much in advance!

There is no generalised way to convert arbitrary JSON into a CSV format.
You will need to do the conversion in your code.

2 Likes

Thank you very much for your help @barry-scott!

Oh really, as I am a beginner in this topic, do you have any suggestion to ease the procedure?
Basically I need this CSV file for further use in statistical softwares, like R, stata and Python.

Thank you very much.

Michael

I succeed to extract the original “raw” JSON, but I would like to polish the .csv

Here is my code:

import pandas as pd
from pandas import json_normalize
import json

def flatten_json(json_data):
    flattened_data = json_normalize(json_data)
    return flattened_data

def json_to_csv(json_file, csv_file):
    # Read JSON file
    with open(json_file, 'r') as f:
        json_data = json.load(f)

    # Flatten JSON data
    flattened_data = flatten_json(json_data)
    
    # Write flattened data to CSV
    flattened_data.to_csv(csv_file, index=False)

# Example usage
json_file = 'result.json'
csv_file = 'output.csv'
json_to_csv(json_file, csv_file)

Sorry if my code contains some errors.
Thank you again for your help.

Michael

Edit: Here is a screenshot of the raw file I obtained, in .csv format:

Please keep in mind that CSV can only represent a flat, 2d table of string values. JSON can represent any arbitrarily nested dictionaries and lists. You need to examine the JSON and understand how it is structured, to even decide whether the task is possible and makes sense; if it is, then you can start to figure out the logical rules for how to take the JSON apart and decide the cell values. If you can make a list (each row) of lists (columns in the row) of strings, then you can save CSV.

Hi @kknechtel,

Thank you for your feedback! After inspecting the JSON, the data that I want is always nested to “properties” name:

"properties":{
"NAMEUNIT":"<strong>Municipio: Mendigorr�a\u003c/strong><br/>",
"Unitario":"Precio unitario medio : 1223 Euros/m<sup>2\u003c/sup><br/>",
"Precio unitario medio del municipio":1223,
"desviacion":"Desviacion tipica: 976<br/>",
"superficie":"Superficie media: 175 m<sup>2\u003c/sup><br/>",
"moda":"Rango de precio mas frecuente: 200-400 Euros/m<sup>2\u003c/sup><br/>",
"poblacion":"Poblacion: 1124<br/>",
"renta_persona":"Renta media por persona: 13554 Euros/a�o<br/>",
"renta_hogar":"Renta media por hogar : 31644 Euros/a�o<br/>"
} ...

Except data with some missing values, which are here:

"na_municipios",{"interactive":true,"className":"","stroke":true,"color":"steelblue","weight":0.1,"opacity":1,"fill":true,"fillColor":"#B0B0B0","fillOpacity":1,"dashArray":"3","smoothFactor":1,"noClip":false},null,null,["<strong>Municipio: Valderrebollo<\/strong><br/>Precio unitario medio: No disponible<br/>Desviacion tipica: No disponible<br/>Superficie media: No disponible<br/>Rango de precio mas frecuente:  No disponible<br/>Poblacion: 23<br/>Renta media por persona: 14898 Euros/a�o<br/>Renta media por hogar: 25639 Euros/a�o", ...
]

Is it possible to extract data as a CSV file then? Thank you again for your help.

:frowning: I also tried this, in vein:

import json
import csv

# Read the JSON file
with open('result.json', 'r') as f:
    json_data = f.read()

# Parse the JSON data
data = json.loads(json_data)

# Open the CSV file in append mode
with open('output.csv', 'a', newline='') as csvfile:
    # Define the CSV writer
    writer = csv.DictWriter(csvfile, fieldnames=["Municipio", "Precio_unitario_medio", "Desviacion_tipica", "Superficie_media", "Rango_precio_mas_frecuente", "Poblacion", "Renta_media_persona", "Renta_media_hogar"])
    
    # Write header if the file is empty
    if csvfile.tell() == 0:
        writer.writeheader()
    
    # Iterate over each entry in the JSON data
    for entry in data:
        # Check if 'properties' key exists in the entry
        if 'properties' in entry:
            properties = entry['properties']
            
            # Extract relevant data
            municipio = properties.get('NAMEUNIT', '').split(': ')[1].split('<')[0].strip()
            precio_unitario_medio = properties.get('Precio unitario medio del municipio', '')
            desviacion_tipica = properties.get('desviacion', '').split(': ')[1].split('<')[0].strip()
            superficie_media = properties.get('superficie', '').split(': ')[1].split(' ')[0].strip()
            rango_precio_mas_frecuente = properties.get('moda', '').split(': ')[1].split('<')[0].strip()
            poblacion = properties.get('poblacion', '').split(': ')[1].split('<')[0].strip()
            renta_media_persona = properties.get('renta_persona', '').split(': ')[1].split(' ')[0].strip()
            renta_media_hogar = properties.get('renta_hogar', '').split(': ')[1].split(' ')[0].strip()
            
            # Write to CSV file
            writer.writerow({
                "Municipio": municipio,
                "Precio_unitario_medio": precio_unitario_medio,
                "Desviacion_tipica": desviacion_tipica,
                "Superficie_media": superficie_media,
                "Rango_precio_mas_frecuente": rango_precio_mas_frecuente,
                "Poblacion": poblacion,
                "Renta_media_persona": renta_media_persona,
                "Renta_media_hogar": renta_media_hogar
            })

It gives me an empty CSV file.

To debug your code add print() calls to show where the code is running and what interesting variables contain.

For example I would print('data:', data) before the for entry in data: loop and print('entry:', entry) at the start of the loop.

1 Like

Thank you so much @barry-scott !

I have tried your suggestion. Basically, it gives me the whole .json again (or this is my impression):
My code was the following:

import json
import csv

# Read the JSON file
with open('result.json', 'r') as f:
    json_data = f.read()

# Parse the JSON data
data = json.loads(json_data)

# Open the CSV file in append mode
with open('output.csv', 'a', newline='') as csvfile:
    # Define the CSV writer
    writer = csv.DictWriter(csvfile, fieldnames=["Municipio", "Precio_unitario_medio", "Desviacion_tipica", "Superficie_media", "Rango_precio_mas_frecuente", "Poblacion", "Renta_media_persona", "Renta_media_hogar"])
    
    # Write header if the file is empty
    if csvfile.tell() == 0:
        writer.writeheader()
    
    # Iterate over each entry in the JSON data
    print('data:', data)
    for entry in data:
        print('entry:', entry)
        # Check if 'properties' key exists in the entry
        if 'properties' in entry:
            properties = entry['properties']
            
            # Extract relevant data
            municipio = properties.get('NAMEUNIT', '').split(': ')[1].split('<')[0].strip()
            precio_unitario_medio = properties.get('Precio unitario medio del municipio', '')
            desviacion_tipica = properties.get('desviacion', '').split(': ')[1].split('<')[0].strip()
            superficie_media = properties.get('superficie', '').split(': ')[1].split(' ')[0].strip()
            rango_precio_mas_frecuente = properties.get('moda', '').split(': ')[1].split('<')[0].strip()
            poblacion = properties.get('poblacion', '').split(': ')[1].split('<')[0].strip()
            renta_media_persona = properties.get('renta_persona', '').split(': ')[1].split(' ')[0].strip()
            renta_media_hogar = properties.get('renta_hogar', '').split(': ')[1].split(' ')[0].strip()
            
            # Write to CSV file
            writer.writerow({
                "Municipio": municipio,
                "Precio_unitario_medio": precio_unitario_medio,
                "Desviacion_tipica": desviacion_tipica,
                "Superficie_media": superficie_media,
                "Rango_precio_mas_frecuente": rango_precio_mas_frecuente,
                "Poblacion": poblacion,
                "Renta_media_persona": renta_media_persona,
                "Renta_media_hogar": renta_media_hogar
            })

And here a small chunk that I obtained through print() as you suggested. Thank you again so much for your help!

data: {'x': {'options': {'minZoom': 5, 'maxZoom': 10, 'crs': {'crsClass': 'L.CRS.EPSG3857', 'code': None, 'proj4def': None, 'projectedBounds': None, 'options': {}}, 'fullscreenControl': {'position': 'topleft', 'pseudoFullscreen': False}}, 'calls': [{'method': 'addProviderTiles', 'args': ['CartoDB.Positron', None, None, {'errorTileUrl': '', 'noWrap': False, 'detectRetina': False}]}, {'method': 'addGeoJSONChoropleth', 'args': [{'type': 'Topology', 'objects': {'foo': {'type': 'GeometryCollection', 'geometries': [{'type': 'MultiPolygon', 'arcs': [[[0, 1, 2]]], 'properties': {'NAMEUNIT': '<strong>Municipio: Mélida</strong><br/>', 'Unitario': 'Precio unitario medio : 729 Euros/m<sup>2</sup><br/>', 'Precio unitario medio del municipio': 729, 'desviacion': 'Desviacion tipica: 398<br/>', 'superficie': 'Superficie media: 190 m<sup>2</sup><br/>', 'moda': 'Rango de precio mas frecuente: 200-400 Euros/m<sup>2</sup><br/>', 'poblacion': 'Poblacion: 714<br/>', 'renta_persona': 'Renta media por persona: 14526 Euros/año<br/>', 'renta_hogar': 'Renta media por hogar : 34162 Euros/año<br/>'}}, {'type': 'MultiPolygon', 'arcs': [[[3, 4, 5, 6, 7, 8, 9, 10, 11]]], ...

My CSV contains again only the headers that I created, so Municipio, Precio_unitario_medio, Desviacion_tipica, etc.

I have discovered something new. My JSON file is broken into dictionaries, right?

import json

# JSON --> result.json
with open('result.json', 'r') as f:
    data = json.load(f)

# keys
print(data.keys())

I obtained this:

dict_keys(['x', 'evals', 'jsHooks'])

The data I am interested in is located in the x dictionnary:

import json

# Charger le JSON depuis le fichier result.json
with open('result.json', 'r') as f:
    data = json.load(f)

# Extract "x" data
x_data = data['x']
evals_data = data['evals']
js_hooks_data = data['jsHooks']

# Show data
print("Data under "x" key :", x_data)

Apparently, when I try to extract data through Power Query, it is located under:

geometries1[properties]

in which I have:

NAMEUNIT
Unitario
Precio unitario medio del municipio
desviacion
superficie
moda
poblacion
renta_persona
renta_hogar

There is a useful function pprint that will format an object nicely.
This will provide a pretty version of the JSON for you.
It will then be easier to see the structure of the JSON data.

from pprint import pprint

pprint(data)
1 Like

Did you send me a direct message and forget to reply here?

1 Like

Hi @barry-scott,

Thank you so much for your suggestion. Yes, indeed this function is super useful: pprint().

The JSON file is now easier to understand.
Thank you so much for your help on this task!

I will see what I can do now with the JSON file…
Thanks a lot, Barry.

Lovely day,
Michael

Sorry about that @barry-scott,

I have answered you. Super useful function.
Thanks a lot for your suggestions!

Hi everyone,

I succeeded in having a proper Excel file, although it is not a .csv one. But for my purposes, this should be sufficient. I used Power Query in MS Excel to extract the data that I needed from the .json file.

I’d like to thank all the people who helped me with this post, who are @barry-scott & @kknechtel.
Thank you so much for your time, advices and feedback. I have learnt a lot from you. Thank you from the bottom of my heart.

python.org (and me the first!) are lucky to have such good programmers, who are really knowledgeable about Python.

We can thus close this thread!
All the best, and see you soon in a new post, I hope!

Michael