Counting Duplicates and Getting Sample Values

Hi, I’m a complete newbie to python and have been getting by using google but I’m stuck. I’m trying to do some data profile on a large csv (1.9m records, 289 columns). I’ve defined some columns below already in the output csv and they give me the correct result.

What I want to do is add a few new columns to the output such as a ‘Sample Values’ column which takes 3 unique values from each column separated by a comma. If there is less than 3 unique values for that column it would just return all uniques or null (if the entire column is null).

The other column I want to create is ‘# of Duplicates’. This would return the total number of duplicates for each column. For example, say column A in my input csv as values A, A, A, B, B, C. There would be 5 total duplicates.

Here is my preexisting code:

import pandas as pd
import numpy as np

df = pd.read_csv('C:/Users/CC/Downloads/All_Contacts/contacts_full.csv', low_memory = False)

#Define Columns
contacts_profile = pd.DataFrame()
contacts_profile['Data Type'] = df.dtypes
contacts_profile['Total Nulls'] = df.isnull().sum()
contacts_profile['% Nulls'] = contacts_profile['Total Nulls] *  100 / len(df)
contacts_profile['Fill Rate'] = 100 - contacts_profile['% Nulls']
contacts_profile['Unique Values'] = df.nunique(axis=0)

#Export Results
path = "C:/Users/CC/Downloads/All_Contacts/"
contacts_profile.to_csv(path + 'contacts_full_profile.csv')
2 Likes

I suggest you state what your problem is exactly but make sure you have read enough about Pandas and then state where you program is going wrong / not giving what you want. Also, possibly some (just a few lines) of test data.