Working on this for a class- and I am stuck… Can you help me figure out how to return all column values?
In this challenge, you’ll be working with a DataFrame that contains data about artworks, and it contains many missing values.
Your task is to create a variable called na_sum that contains the total number of missing values in the DataFrame. When that’s completed, print out your answer!
Hint: The code given below will give you the number of missing (NaN) values for the Name column in the DataFrame. How would you edit the code to get the missing values for every column in the DataFrame? Extra hint: You’ll be returning a single number which is the final sum() of everything.
Get the sum of all missing values in the DataFrame
na_sum = df[:].isnull().sum()
Print the answer
print(na_sum)
—RESULT—
Test Results:
Log
Artwork ID 0
Title 52
Artist ID 1460
Name 1460
Date 2312
Medium 11919
Dimensions 11463
Acquisition Date 5463
Credit 3070
Catalogue 0
Department 0
Classification 0
Object Number 0
Diameter (cm) 128863
Circumference (cm) 130252
Height (cm) 18369
Length (cm) 129526
Width (cm) 19259
Depth (cm) 118819
Weight (kg) 129964
Duration (s) 127178
dtype: int64
Test
test_reverse
Unhandled Exception
Error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Traceback
File “/workspace/datascience/tests/test_solution.py”, line 30, in test_reverse
self.assertEqual(839429, na_sum)
File “/usr/local/lib/python3.7/unittest/case.py”, line 852, in assertEqual
assertion_func(first, second, msg=msg)
File “/usr/local/lib/python3.7/unittest/case.py”, line 842, in _baseAssertEqual
if not first == second:
File “/workspace/datascience/lib/python3.7/site-packages/pandas/core/generic.py”, line 1479, in nonzero
f"The truth value of a {type(self).name} is ambiguous. "
I however think the hint from the assignment is a bit misleading and isn’t teaching idiomatic pandas. All pandas DataFrame operations are performed per column, by default, as pandas uses a column major layout for data. Let me elaborate below.
There will return you a DataFrame where each element is true or false, true if the element is numpy.nan false otherwise. To get the sum of each column I would look at pandas.DataFrame.sum which will perform a summation on each column. pandas.DataFrame.sum — pandas 1.5.0 documentation
Note that true values are counted as a 1, and false as a zero.
This will yield the number of null values in each column. The resultant object should be a pandas.Series which too has a sum method that will sum the values of the series.
You are very close in your answer. You have the number of null values per column but not the total for the entire DataFrame. You’re missing one step.
Use .loc and .isna() to find missing values in a single column, use a for loop for i in df.columns and df.loc[:, i] notation to go through the dataframe column by column, use df.sum() to count True values in column after boolean conversion, use += to quickly increase your counter by some value. Good luck with your assignment.
Pro-tip - you can skip all of the above and do it like this. Can you figure out why it works on your own?: