Sum of columns pandas

elfzwolf · October 9, 2022, 12:29pm

Hello

I’m new to python some 6 weeks in, and working on my project.
I have a trouble creating a new column that would sum two existing ones, that I’ve created form Pivot Tables.
The bale look like this:

                                            size          
Timely_Response                               No       Yes
Company                                                   
WELLS FARGO & COMPANY                     3244.0   67675.0
EQUIFAX, INC.                             1627.0  114076.0
BANK OF AMERICA, NATIONAL ASSOCIATION     1585.0   80519.0
OCWEN LOAN SERVICING LLC                   542.0   27208.0
Colony Brands, Inc.                        419.0     256.0

What I need to do is to add “Total” column that would sum “No” and “Yes” values for each row
and then add “‘No’ % of Total” and then sort by that % column.

I was trying to use sum, but that gave me too large numbers

slow_resolution_sorted["Totals"]=slow_resolution_sorted.sum(axis=1)

                                             size                 Totals
Timely_Response                                No        Yes            
Company                                                                 
WELLS FARGO & COMPANY                      3244.0    67675.0    567352.0
EQUIFAX, INC.                              1627.0   114076.0    925624.0

then I was trying to add No and Yes but I got an error.

slow_resolution_sorted["Totals"]=slow_resolution_sorted["No"]+slow_resolution_sorted["Yes"]

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
→ 3082 raise KeyError(key) from err
3083
3084 if tolerance is not None:

KeyError: ‘No’

Any ideas what I’m doing wrong and how to fix it?
Thanks!

MRAB · October 9, 2022, 5:37pm

Add the total column:

table["Total"] = table["No"] + table["Yes"]

Add the percentage column:

table["'No' % of Total"] = table["No"] / table["Total"] * 100

Sort by the percentage:

table.sort_values(["'No' % of Total"], ascending=[True], inplace=True)

elfzwolf · October 10, 2022, 11:14am

Hi @MRAB
Thanks for the suggestion but it doesn’t work for me
I’ve tried

slow_resolution_sorted["Totals"]=slow_resolution_sorted["No"]+slow_resolution_sorted["Yes"]

but I’m getting this error

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3079             try:
-> 3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'No'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-54-c8770f865dc6> in <module>
      2 #TO DO
      3 #slow_resolution_sorted["Totals"]=slow_resolution_sorted.sum(axis=1)
----> 4 slow_resolution_sorted["Totals"]=slow_resolution_sorted["No"]+slow_resolution_sorted["Yes"]
      5 
      6 print(slow_resolution_sorted)

~\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3021         if is_single_key:
   3022             if self.columns.nlevels > 1:
-> 3023                 return self._getitem_multilevel(key)
   3024             indexer = self.columns.get_loc(key)
   3025             if is_integer(indexer):

~\anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_multilevel(self, key)
   3072     def _getitem_multilevel(self, key):
   3073         # self.columns is a MultiIndex
-> 3074         loc = self.columns.get_loc(key)
   3075         if isinstance(loc, (slice, np.ndarray)):
   3076             new_columns = self.columns[loc]

~\anaconda3\lib\site-packages\pandas\core\indexes\multi.py in get_loc(self, key, method)
   2874 
   2875         if not isinstance(key, tuple):
-> 2876             loc = self._get_level_indexer(key, level=0)
   2877             return _maybe_to_slice(loc)
   2878 

~\anaconda3\lib\site-packages\pandas\core\indexes\multi.py in _get_level_indexer(self, key, level, indexer)
   3156         else:
   3157 
-> 3158             idx = self._get_loc_single_level_index(level_index, key)
   3159 
   3160             if level > 0 or self.lexsort_depth == 0:

~\anaconda3\lib\site-packages\pandas\core\indexes\multi.py in _get_loc_single_level_index(self, level_index, key)
   2807             return -1
   2808         else:
-> 2809             return level_index.get_loc(key)
   2810 
   2811     def get_loc(self, key, method=None):

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:
-> 3082                 raise KeyError(key) from err
   3083 
   3084         if tolerance is not None:

KeyError: 'No'

I am wondering if this is to do with the table shape??
see the column “No” and “Yes” are “size, No” and “size, Yes”?

slow_resolution_sorted.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5275 entries, WELLS FARGO & COMPANY to xCoins, Inc.
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   (size, No)   5275 non-null   float64
 1   (size, Yes)  5275 non-null   float64
dtypes: float64(2)
memory usage: 123.6+ KB

Any suggestions please?
Thank you!

MRAB · October 10, 2022, 1:28pm

What are the names/keys of those columns ('size', 'No') and ('size', 'Yes')?

elfzwolf · October 10, 2022, 1:57pm

Hi @MRAB
I created this table from DataFrame to a Pivot Table
so I think if it was excel the table it would look like this

abessman · October 11, 2022, 8:37am

It would be helpful to see the representation of the DataFrame, and the code you used to create it. The following is based on my best guess as to how your data is structured.

It seems like you have multiple column levels, like this:

>>> df
                         size      
                          Yes    No
WELLS FARGO & COMPANY   67676.0  3244.0
Equifax                114076.0  1627.0

You can address each column like this:

>>> df[("size", "Yes")]
WELLS FARGO & COMPANY     67676.0
Equifax                  114076.0
Name: (size, Yes), dtype: float64

or like this:

>>> df["size"]["Yes"]
WELLS FARGO & COMPANY     67676.0
Equifax                  114076.0
Name: Yes, dtype: float64

To take the sum of the columns:

>>> df["Total"] = df["size"]["Yes"] + df["size"]["No"]
>>> df
                           size             Total
                            Yes      No          
WELLS FARGO & COMPANY   67676.0  3244.0   70920.0
Equifax                114076.0  1627.0  115703.0

or

>>> df["Total"] = df["size"].sum(axis=1)
>>> df
                           size             Total
                            Yes      No          
WELLS FARGO & COMPANY   67676.0  3244.0   70920.0
Equifax                114076.0  1627.0  115703.0

elfzwolf · October 11, 2022, 10:54am

Hi @abessman

I think you’re onto something re pivot table quality, there are layers of column and I used but it gave me wrong totals.

df["Total"] = df["size"]["Yes"] + df["size"]["No"] 
                                            size              Totals
Timely_Response                               No       Yes          
Company                                                             
WELLS FARGO & COMPANY                     3244.0   67675.0  135350.0
EQUIFAX, INC.                             1627.0  114076.0  228152.0

I think my Pivot table is faulty this is how I created it. see below df head and info.
there is a column Timely_Response? With values Yes, No. I wanted to count how many times Yes or No there was for each company and used this code. where there was no Yes’s or No’s I got NaN which I filled with 0. see my aggfung is “size” for some reason when I tried to use “count” it was calling 4 different tables.
Not sure what’s going on here tbh.

slow_resolution=pd.pivot_table(compl_df, index=["Company"], columns=["Timely_Response"], aggfunc=["size"])
slow_resolution_sorted = slow_resolution.reindex(slow_resolution["size"].sort_values(by="No", ascending=False).index)
slow_resolution_sorted=slow_resolution_sorted.fillna(0)
print(slow_resolution_sorted)

#dataframe

	Date received	Product	Sub-product	Issue	Sub-issue	Consumer complaint narrative	Company public response	Company	State	ZIP code	Consumer consent provided?	Submitted via	Date sent to company	Company response to consumer	Timely response?	Consumer disputed?	Complaint ID
0	2019-05-10	Checking or savings account	Checking account	Managing an account	Problem using a debit or ATM card	No Data	No Data	NAVY FEDERAL CREDIT UNION	FL	328XX	N/A	Web	2019-05-10	In progress	Yes	N/A	3238275
1	2019-05-10	Checking or savings account	Other banking product or service	Managing an account	Deposits and withdrawals	No Data	No Data	BOEING EMPLOYEES CREDIT UNION	WA	98204	N/A	Referral	2019-05-10	Closed with explanation	Yes	N/A	3238228
2	2019-05-10	Debt collection	Payday loan debt	Communication tactics	Frequent or repeated calls	No Data	No Data	CURO Intermediate Holdings	TX	751XX	N/A	Web	2019-05-10	Closed with explanation	Yes	N/A	3237964

#df.info()

compl_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1282355 entries, 0 to 1282354
Data columns (total 19 columns):
 #   Column            Non-Null Count    Dtype         
---  ------            --------------    -----         
 0   Date_Received     1282355 non-null  datetime64[ns]
 1   Product           1282355 non-null  object        
 2   Sub-product       1282355 non-null  object        
 3   Issue             1282355 non-null  object        
 4   Sub-issue         1282355 non-null  object        
 5   Complaint_Detail  1282355 non-null  object        
 6   Public_Response   1282355 non-null  object        
 7   Company           1282355 non-null  object        
 8   State             1282355 non-null  object        
 9   ZIP_Code          1282355 non-null  object        
 10  Consumer_Consent  1282355 non-null  object        
 11  Medium            1282355 non-null  object        
 12  Date_Sent         1282355 non-null  datetime64[ns]
 13  Response_Detail   1282355 non-null  object        
 14  Timely_Response   1282355 non-null  object        
 15  Disputed          1282355 non-null  object        
 16  Complaint ID      1282355 non-null  int64         
 17  Year              1282355 non-null  int64         
 18  Month             1282355 non-null  int64         
dtypes: datetime64[ns](2), int64(3), object(14)
memory usage: 185.9+ MB

abessman · October 11, 2022, 11:39am

Greg Florczak:

#dataframe

	Date received	Product	Sub-product	Issue	Sub-issue	Consumer complaint narrative	Company public response	Company	State	ZIP code	Consumer consent provided?	Submitted via	Date sent to company	Company response to consumer	Timely response?	Consumer disputed?	Complaint ID
0	2019-05-10	Checking or savings account	Checking account	Managing an account	Problem using a debit or ATM card	No Data	No Data	NAVY FEDERAL CREDIT UNION	FL	328XX	N/A	Web	2019-05-10	In progress	Yes	N/A	3238275
1	2019-05-10	Checking or savings account	Other banking product or service	Managing an account	Deposits and withdrawals	No Data	No Data	BOEING EMPLOYEES CREDIT UNION	WA	98204	N/A	Referral	2019-05-10	Closed with explanation	Yes	N/A	3238228
2	2019-05-10	Debt collection	Payday loan debt	Communication tactics	Frequent or repeated calls	No Data	No Data	CURO Intermediate Holdings	TX	751XX	N/A	Web	2019-05-10	Closed with explanation	Yes	N/A	3237964

Is this slow_resolution_sorted or compl_df? If it is the former, it looks like it contains a lot more than just the desired Timely_Response column.

I think we need to go further back. How was compl_df created?

elfzwolf · October 11, 2022, 3:41pm

Hi @abessman
you’re correct! there was something with the tables. compl_df is a copy of CSV file
and then slow_resolultion_sorted was a subset of of that.

I just use different method crosstab and it worked charm
Thanks for your help!

slow_resolution=compl_df[["Company","Timely_Response"]].copy()
slow_resolution2=pd.crosstab(slow_resolution["Company"], slow_resolution["Timely_Response"].fillna("0"))
slow_resolution2["Total"] = slow_resolution2["Yes"] + slow_resolution2["No"]
slow_resolution2["'No' % of Total"] = slow_resolution2["No"] / slow_resolution2["Total"] * 100
slow_resolution2.sort_values(["'No' % of Total","No"], ascending=[False, False], inplace=True)
slow_resolution2.head(50)

Topic		Replies	Views
Creation of new column Python Help help	1	316	December 9, 2022
Adding two columns to a dataframe Python Help help	4	439	January 16, 2023
Creating new columns with conditions Python Help	7	320	May 10, 2022
Column-wise calculations Python Help help	3	1003	January 5, 2022
Sum of first n rows Python Help help	12	1278	February 13, 2023

Sum of columns pandas

Related Topics