Countplot of a categorical variable shows too many values. Are they specific values or ranges?

Tudor · May 24, 2021, 6:39pm

Hello! I want to apply some classification algorithms on a data set with a categorical variable of two possible values: 0 and 1. I built a countplot on this categorical variable and got this result:

What should I do in this situation?

The data set I am using is a public data set from Kaggle, and this is for a college project, so the data is not critical. I can remove the rows that have the wrong data.

In this plot, it’s clear that there are some values of 5, but I don’t understand the rest of the values, for example the “46”, “618” or “269375”.

Are they specific values that I have to process or are they ranges?

Blackward · May 26, 2021, 4:45pm

Hi Tudor,

can you please provide us the belonging code snippet?

Cheers, Dominik

Tudor · June 2, 2021, 10:48am

Hello! Here it is:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df_train = pd.read_csv(r'invoice_train.csv')
df_test = pd.read_csv(r'invoice_test.csv')
sns.countplot(x=counter_status,data=df_train)
plt.show()

And here is the data source:

Actually, I’ve already mapped all these values as ‘1’.

Thanks for the reply anyway!