Returning sum of missing values for all columns

sarahw · October 13, 2022, 12:36pm

Working on this for a class- and I am stuck… Can you help me figure out how to return all column values?

In this challenge, you’ll be working with a DataFrame that contains data about artworks, and it contains many missing values.

Your task is to create a variable called na_sum that contains the total number of missing values in the DataFrame. When that’s completed, print out your answer!

Hint: The code given below will give you the number of missing (NaN) values for the Name column in the DataFrame. How would you edit the code to get the missing values for every column in the DataFrame?
Extra hint: You’ll be returning a single number which is the final sum() of everything.

df['Name'].isnull().sum()

abessman · October 13, 2022, 12:51pm

Hint 1:

You can iterate over every column name in the DataFrame like this:

for col in df:
    print(col)

Hint 2:

You can access each column, and its methods (including isnull), like this:

for col in df:
    df[col].isnull().sum()

See if you can’t figure it out from there. Feel free to ask for more hints.

sarahw · October 13, 2022, 9:53pm

I have this currently and it reviews the counts for each column:

import pandas as pd
df = pd.read_csv(‘https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/museum-collection-dataset/artworks.csv’)

Get the sum of all missing values in the DataFrame

na_sum = df[:].isnull().sum()

Print the answer

print(na_sum)

—RESULT—

Test Results:
Log
Artwork ID 0
Title 52
Artist ID 1460
Name 1460
Date 2312
Medium 11919
Dimensions 11463
Acquisition Date 5463
Credit 3070
Catalogue 0
Department 0
Classification 0
Object Number 0
Diameter (cm) 128863
Circumference (cm) 130252
Height (cm) 18369
Length (cm) 129526
Width (cm) 19259
Depth (cm) 118819
Weight (kg) 129964
Duration (s) 127178
dtype: int64
Test
test_reverse
Unhandled Exception
Error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Traceback
File “/workspace/datascience/tests/test_solution.py”, line 30, in test_reverse
self.assertEqual(839429, na_sum)
File “/usr/local/lib/python3.7/unittest/case.py”, line 852, in assertEqual
assertion_func(first, second, msg=msg)
File “/usr/local/lib/python3.7/unittest/case.py”, line 842, in _baseAssertEqual
if not first == second:
File “/workspace/datascience/lib/python3.7/site-packages/pandas/core/generic.py”, line 1479, in nonzero
f"The truth value of a {type(self).name} is ambiguous. "

abessman · October 14, 2022, 5:55pm

Well, you’re very close. You’re just missing this final step from the instructions:

Extra hint: You’ll be returning a single number which is the final sum() of everything.

What you have there is the number of NaNs in each column. How could you get the sum of all of those numbers?

Melendowski · October 14, 2022, 9:04pm

Hi Sarah,

Alexander is giving good direction.

I however think the hint from the assignment is a bit misleading and isn’t teaching idiomatic pandas. All pandas DataFrame operations are performed per column, by default, as pandas uses a column major layout for data. Let me elaborate below.

For example, there is a method isnull on the DataFrame itself. pandas.DataFrame.isnull — pandas 1.5.0 documentation

There will return you a DataFrame where each element is true or false, true if the element is numpy.nan false otherwise. To get the sum of each column I would look at pandas.DataFrame.sum which will perform a summation on each column. pandas.DataFrame.sum — pandas 1.5.0 documentation

Note that true values are counted as a 1, and false as a zero.

This will yield the number of null values in each column. The resultant object should be a pandas.Series which too has a sum method that will sum the values of the series.

You are very close in your answer. You have the number of null values per column but not the total for the entire DataFrame. You’re missing one step.

You could have your answer in 1 line…

pandas.DataFrame.isnull().sum().sum()

sarahw · October 17, 2022, 2:52pm

You guys are absolutely brilliant! Thank you both for your help. I have so much to learn!

Here was the correct code!

import pandas as pd
df = pd.read_csv(‘https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/museum-collection-dataset/artworks.csv’)

Get the sum of all missing values in the DataFrame

na_sum = df[:].isnull().sum().sum()

Print the answer

print(na_sum)

vovavili · October 18, 2022, 6:04am

Use .loc and .isna() to find missing values in a single column, use a for loop for i in df.columns and df.loc[:, i] notation to go through the dataframe column by column, use df.sum() to count True values in column after boolean conversion, use += to quickly increase your counter by some value. Good luck with your assignment.

Pro-tip - you can skip all of the above and do it like this. Can you figure out why it works on your own?:

df.isna().sum().sum()