Traverse through multiple folder/subfolders and extract Json data and store into CSV using python

kmutyala · October 21, 2022, 12:36pm

I have multiple JSON files in multiple folders and subfolders C:\Users\kma\Desktop\Bots****\content.json

the stars (*) above represent multiple folders and each subfolder has one json file named as content.json and I want to extract some data from it. I need help to traverse recursively through these files and append all the data to one csv file.thanks

facelessuser · October 21, 2022, 2:34pm

I would check out glob. It allows you to create recursive search patterns. For instance, you could search with a pattern: c:/Users/kma/Desktop/Bots/**/content.json.

Python also has a json module which is great for reading data from JSON.

Lastly, Python has a CSV module allowing you to write to a CSV file.

There are also plenty of 3rd party libraries that can enhance your experience with these tasks as well, but the aforementioned modules should be sufficient to get the job done.

kmutyala · October 21, 2022, 3:15pm

@facelessuser I have used all those three modules and able to do the recursion and its working fine

For each row I want append one more record which is the base path of the folder which that file belongs to. Any suggestions on how to add it. I know how to extract it from the file path but appending it along with the output row , Im not able to figure out

import glob
import csv
import json
import os

files = glob.glob('C:\\Users\\kmutyala\\Desktop\\Bots\\**\\content.json', recursive = True)
count =1 
#print(fnames)
for f in files[:3:]:
 with open(f) as json_file:
  data = json.load(json_file)
 employee_data = data['packages']
 data_file = open('data_file.csv', 'a')
 csv_writer = csv.writer(data_file)

 print(count)
 count += 1
 #print(employee_data)
 #print(f)
 for emp in employee_data:
  print(emp)   
  csv_writer.writerow(emp.values())

 data_file.close()

facelessuser · October 21, 2022, 3:29pm

I assume part of the issue is that you are creating a new csv document on every new file? I would consider moving the creation of csv_writer to occur outside of the loop so you just have one CSV document created for all files.

kmutyala · October 21, 2022, 3:33pm

this is resolved but I need help on this :
For each row I want append one more record which is the base path of the folder which that file belongs to. Any suggestions on how to add it. I know how to extract it from the file path but appending it along with the output row , Im not able to figure out

facelessuser · October 21, 2022, 3:42pm

I assume emp.values() is a list? I assume something like this could work:

csv_writer.writerow(emp.values() + [filename])

I don’t fully understand the output you are looking for, but you just need to ensure you are passing a list that contains all the information you want.

EDIT: I originally had it prepending, not appending.

kmutyala · October 21, 2022, 3:47pm

csv_writer.writerow(emp.values()+[os.path.dirname(f)])
TypeError: unsupported operand type(s) for +: ‘dict_values’ and ‘list’

emp.values has key value pair values , its a dictionary

{'name': 'Application', 'version': '2.1.0-20210118-184957'}
{'name': 'Comment', 'version': '2.4.0-20210118-185007'}
{'name': 'ErrorHandler', 'version': '2.4.0-20210118-185029'}

facelessuser · October 21, 2022, 3:53pm

Sorry:

csv_writer.writerow(list(emp.values()) + [filename])

kmutyala · October 21, 2022, 3:55pm

Awesome that works.Thanks for the help

kmutyala · October 21, 2022, 4:03pm

@facelessuser Actually I have huge volumes of data and CSV is not taking more that 5146 rows and Im getting this error:

Traceback (most recent call last):
  File "C:/Users/kmutyala/Desktop/pkg3.py", line 13, in <module>
    data = json.load(json_file)
  File "C:\Users\kmutyala\AppData\Local\Programs\Python\Python39\lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
  File "C:\Users\kmutyala\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 85173: character maps to <undefined>

Any limitations on the nu,ber of rows for CSV Writer

facelessuser · October 21, 2022, 4:07pm

This is not a row limit, this is a Unicode decode issue.

kmutyala · October 21, 2022, 4:29pm

@facelessuser Okay, Do you suggest any workaround on this thanks.

Not able to find whats causing this issue

facelessuser · October 21, 2022, 5:17pm

The actual error is happening in the JSON parsing. Something is wrong with the file or options parsing it. I’m not sure.

vbrozik · October 21, 2022, 7:31pm

The byte 0x9d cannot be decoded by the default encoding being used. This is platform dependent and it seems to be cp1252 on yours.^[1] The solution is to find out which encoding was used to create the JSON file and specify the encoding in the open() call:

It is pretty possible that it is UTF-8.

…really weird world of Windows at the time UTF-8 is almost everywhere… ↩︎

kmutyala · October 23, 2022, 7:19am

@vbrozik Thank you , Looks like UTF 8 resolved the issue