Create one master data file with specific data points extracted from +5000 excel workbooks

Objective: Create one master data file with specific data cells extracted from thousands of excel workbooks which was created over the last +10 years.

Current Situation: I have a folder with subfolders and thousands of files in pdf, word, ppt and excel.

Issues: I would need to extract from only the relevant excel workbooks data into one excel spreadsheet as master data file.

Anchor points: All relevant excel workbooks have following in common:
a) They have the same data of creation. ( This could be used as the search criterium for selecting the relevant files)
b) The relevant data to be extracted are referenced to named cells ie. the position of the relevant cells can be in different rows or columns but the name is the same and defined

Output: The master data spreadsheet would show per Excel workbook (approx. 5000 line) one row with up to 500 data points. Each column would reference to the “named cell”

Any ideas how to slove this?

Hi !

Your project looks totally feasible using Python. I would suggest the following approach:

  • You first create the final spreadsheet, as a pandas dataframe
  • You iterate over the files contained in your base folder, filtering by their extension (.xls or .xlsx I assume). That can be done using pathlib.glob
  • If needed, you can further filter the yielded Excel files by their creation date (pathlib.Path.stat can help you for that)
  • For each Excel file, you load it as a pandas dataframe, extract the information of interest, and copy it to your final spreadsheet
  • Once you have looped through all the source files and all the information is copied to the final spreadsheet, you save it as an Excel file (pandas.DataFrame.to_excel)

For the practical implementation, there is quite much documentation for the pathlib and pandas modules. You can also find many helpful tutorials/guides/code examples/threads on internet related to your problem, you’re certainly not the first person wanting to extract information from multiple Excel files! (first result on google: python - Data Extraction from multiple excel files in pandas dataframe - Stack Overflow)
Hope this helps!

2 Likes