Counting Duplicates and Getting Sample Values

gooseontheloose · July 1, 2022, 8:29am

Hi, I’m a complete newbie to python and have been getting by using google but I’m stuck. I’m trying to do some data profile on a large csv (1.9m records, 289 columns). I’ve defined some columns below already in the output csv and they give me the correct result.

What I want to do is add a few new columns to the output such as a ‘Sample Values’ column which takes 3 unique values from each column separated by a comma. If there is less than 3 unique values for that column it would just return all uniques or null (if the entire column is null).

The other column I want to create is ‘# of Duplicates’. This would return the total number of duplicates for each column. For example, say column A in my input csv as values A, A, A, B, B, C. There would be 5 total duplicates.

Here is my preexisting code:

import pandas as pd
import numpy as np

df = pd.read_csv('C:/Users/CC/Downloads/All_Contacts/contacts_full.csv', low_memory = False)

#Define Columns
contacts_profile = pd.DataFrame()
contacts_profile['Data Type'] = df.dtypes
contacts_profile['Total Nulls'] = df.isnull().sum()
contacts_profile['% Nulls'] = contacts_profile['Total Nulls] *  100 / len(df)
contacts_profile['Fill Rate'] = 100 - contacts_profile['% Nulls']
contacts_profile['Unique Values'] = df.nunique(axis=0)

#Export Results
path = "C:/Users/CC/Downloads/All_Contacts/"
contacts_profile.to_csv(path + 'contacts_full_profile.csv')

mikecm · July 3, 2022, 10:15am

I suggest you state what your problem is exactly but make sure you have read enough about Pandas and then state where you program is going wrong / not giving what you want. Also, possibly some (just a few lines) of test data.

Topic		Replies	Views
Using .nunique() to get a count of those values for each column Python Help help	3	638	September 16, 2022
Duplicate values Python Help	12	651	May 28, 2022
Total no. of rows in a csv Python Help	9	365	June 27, 2023
how to replace the duplicate numbers in column " received " in the attached sample with value zero but keeping the last or first value table using Python Python Help help	2	372	August 10, 2022
Returning sum of missing values for all columns Python Help	6	3278	October 18, 2022

Counting Duplicates and Getting Sample Values

Related Topics