How to merge several datasets, when variables differ but are similar?

michaeldg_94 · December 5, 2023, 1:59pm

Hi everyone,

I have a question to ask you, please:

I have recently asked about PDF scraping. So now, I will use this scraped data to merge them with another file. In particular, I am interested in the following:

I want to use the data that I have scraped for the years 2015, 2016, 2017 and merge these three datasets with another one. Basically, I want to do merge based on the vehicles model for each year.

Problem: The variable that includes car models and their prices is not written in the same way between these same databases, and moreover between the registration database.

I don’t know how to do this, and I’m stuck. Is there any way of matching the data sets, even if the models are not totally similar?

I would like to share some data. Is there a way to share datasets on python.org, please?

Thank you very much in advance for your help!
All the best,

hansgeunsmeyer · December 5, 2023, 4:07pm

As I understand your question, you have two or more .csv files (or DataFrame objects) with the same column names. Some of the values in column “model” and column “price” (or whatever those columns are called) are not exactly the same, but you want to treat them as the same when merging the tables.

If so, then you can either normalize the column values before you merge the tables – so you’d only be using one kind of representation of those values – or you could apply some fuzzy matching.
For the price column, I would normalize the column values, using some regex plus substitution.
So, you first rewrite all prices in order to make the format of the values the same. (Using any kind of fuzzy matching would very likely not get this right, since by default it doesn’t differentiate between format and content, – you don’t want to treat prices as the same if they only differ in one digit for instance.)

For the model column you could apply fuzzy matching. It turns out there are some Python libraries that do this (I never tried them so cannot tell more about it), for instance: GitHub - maxharlow/csvmatch: 🔎 Finds fuzzy matches between CSV spreadsheets

Alternatively, you can use difflib (part of Python stdlib), and then use difflib.close_matches:
difflib — Helpers for computing deltas — Python 3.12.0 documentation.
For a short tutorial on this, see: How to Perform Fuzzy Matching in Pandas (With Example) - Statology.

michaeldg_94 · December 5, 2023, 4:18pm

Hi,

First, thanks for your time and good explanations and links.
I am a novice in python and unfortunately does not know much about python commands.

What do you mean by “normalizing the column values” before the merge, please?
What is regex used for? Would it be possible to give me an example with a toy dataset so that I can better understand what it’s all about please? I apologize, I am used to using stata, but try now to delve deeper into python.

Thank you in advance for your help!

hansgeunsmeyer · December 5, 2023, 4:26pm

Normalizing values means make their format identical. E.g. if one value is “$1234” and another is “$1,234.00” and another just “1234”, you change all of these to “1234.00” so you can merge tables on those values.
A regex substitution rule can be use to do that normalization. If you ask ChatGPT this question, it will give you examples plus correct code for this (it cannot generally be trusted, but it will get this right and you can easily check).