Subtracting columns not working

quickhelps · January 6, 2022, 1:53pm

It seems simple but I can’t seem to find an efficient way to solve this in Python 3: Is there is a loop I can use in my dataframe that takes every column after the current column (starting with the 1st column), and subtracts it from the current column, so that I can add that resulting column to a new dataframe?

This is what my data looks like:

	on	3m	6m	9m	1y	18m	2y	3y	4y	5y	6y	7y	8y	9y	10y	15y	20y	25y
Dates																		
2010-03-09	0.5	0.52	0.58	0.90	1.19	2.01	2.70	3.66	4.37	4.87	5.32	5.54	5.77	5.80	5.89	6.04	6.19	NaN
2010-03-10	0.5	0.52	0.57	0.82	1.12	1.93	2.63	3.60	4.30	4.81	5.30	5.53	5.76	5.80	5.89	6.04	6.19	NaN
2010-03-11	0.5	0.52	0.57	0.80	1.10	1.93	2.63	3.58	4.30	4.81	5.30	5.53	5.76	5.80	5.89	6.04	6.19	NaN
2010-03-12	0.5	0.52	0.56	0.78	1.06	1.85	2.58	3.52	4.25	4.77	5.27	5.50	5.75	5.79	5.89	6.04	6.19	NaN
2010-03-15	0.5	0.52	0.56	0.76	1.05	1.84	2.56	3.52	4.25	4.75	5.25	5.48	5.73	5.76	5.86	6.01	6.16	NaN

This is what I want the resulting dataframe to look like:

Here is what I’ve done so far, but when running run_analysis my “result” equation brings up an error.

storage = [] #container that will store the results of the subtracted columns
def subtract (a,b): #function to call to do the column-wise subtractions
    return a-b

def run_analysis (frame, store):
    for first_col_index in range(len(frame)): #finding the first column to use
        temp=[] #temporary place to store the column-wise values from the analysis
        for sec_col_index in range(len(frame)): #finding the second column to subtract from the first column
            if (sec_col_index <= first_col_index): #if the column is below the current column or is equal to 
                                                   #the current column, then skip to next column
                continue
            else:
                result = [r for r in map(subtract, frame[sec_col_index], frame[first_col_index])]
            #if column above our current column, the subtract values in the column and keep the result in temp
                temp.append(result)
        store.append(temp) #save the complete analysis in the store

TobiasHT · January 6, 2022, 2:21pm

What’s the error?
Also, the solution would work much better if I knew exactly how the table was being represented in your code. When I wrote the solution, I used a list of lists, so modification needs to be made to fit your exact representation of the table.

The best solution however is to sacrifice some time and first learn Python from the basics to some level where you can write the solution on your own. Taking shortcuts kinda humpers one’s learning

quickhelps · January 6, 2022, 2:25pm

Hello and thank you Tobias,
This is the error

In [30]: run_analysis(spotrates, storage)
Out [30]: 
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3079             try:
-> 3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 1

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-30-537108194349> in <module>
----> 1 run_analysis(spotrates, storage)

<ipython-input-10-c91f7b8bf19b> in run_analysis(frame, store)
     10                 continue
     11             else:
---> 12                 result = [r for r in map(subtract, frame[sec_col_index], frame[first_col_index])]
     13                 temp.append(result)
     14         store.append(temp)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3022             if self.columns.nlevels > 1:
   3023                 return self._getitem_multilevel(key)
-> 3024             indexer = self.columns.get_loc(key)
   3025             if is_integer(indexer):
   3026                 indexer = [indexer]

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:
-> 3082                 raise KeyError(key) from err
   3083 
   3084         if tolerance is not None:

KeyError: 1

TobiasHT · January 6, 2022, 2:37pm

Yeah, the solution isn’t meant to work with pandas. It will need a modification.

quickhelps · January 6, 2022, 2:40pm

You’re right, modifications will need to be made, however the final solution I need to represent is based off of this loop so once I can get this, then I can use it as a base to modify and finish up the next dataframe I will have to work with.

If you don’t mind helping me figure out how to get past this, I can make the further changes on my own, but unfortunately I am on a deadline right now, so any quick help would be greatly appreciated.

quickhelps · January 6, 2022, 2:55pm

Could you point me in the right direction as to how I should approach this?

TobiasHT · January 6, 2022, 3:08pm

So the loop assumes that the data frame is a lists of lists. Perhaps if you could find a way of converting or creating a list of lists from your pandas data frame, then the solution would work perfectly. Or else, you’d have to modify the run_analysis function to understand the panda data frame. I can’t do that though since I’ve not used pandas before.

quickhelps · January 6, 2022, 3:11pm

Okay, I’ll have to modify it as I want to keep it within a Pandas Dataframe. Thank you so much!

aivarpaalberg · January 7, 2022, 8:21am

Do I understand correctly that you want to subtract values in first column from all other column values?

For starters I create sample dataframe:

+>>> import pandas as pd
+>>> import numpy as np
+>>> df = pd.DataFrame(np.arange(25).reshape(5,5), columns=list('abcde'))
+>>> df
    a   b   c   d   e
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19
4  20  21  22  23  24

Now I want new dataframe with all columns except first one. I just select all rows and columns starting from column ‘b’:

+>>> df.loc[:,'b':]
    b   c   d   e
0   1   2   3   4
1   6   7   8   9
2  11  12  13  14
3  16  17  18  19
4  21  22  23  24

Alternatively I can use indicies:

+>>> df.iloc[:, 1:]
    b   c   d   e
0   1   2   3   4
1   6   7   8   9
2  11  12  13  14
3  16  17  18  19
4  21  22  23  24

Now I just subtract values from first column:

+>>> df.loc[:, 'b':].subtract(df['a'], axis=0)
   b  c  d  e
0  1  2  3  4
1  1  2  3  4
2  1  2  3  4
3  1  2  3  4
4  1  2  3  4

Or alternatively using indices:

+>>> df.iloc[:, 1:].subtract(df.iloc[:, 0], axis=0)
   b  c  d  e
0  1  2  3  4
1  1  2  3  4
2  1  2  3  4
3  1  2  3  4
4  1  2  3  4

Of course, if it’s one value (as it appears to be in this specific case) it can be subtracted directly:

+>>> df_2 = df.loc[:, 'b':].subtract(df['a'], axis=0)
+>>> df_2
   b  c  d  e
0  1  2  3  4
1  1  2  3  4
2  1  2  3  4
3  1  2  3  4
4  1  2  3  4
+>>> df_2 - 1
   b  c  d  e
0  0  1  2  3
1  0  1  2  3
2  0  1  2  3
3  0  1  2  3
4  0  1  2  3

CAM-Gerlach · January 10, 2022, 11:57pm

@quickhelps I’d refer you to @aivarpaalberg 's excellent response for a direct answer to your question, but to note, you really should avoid for loops on Numpy arrays and Pandas DataFrames if at all possible, as especially for beginner tasks, there is always a more concise, much faster and more idiomatic way to do whatever you want using built-in numpy/pandas functions, methods and other operations.

It may be different form what you’re used to working with C or vanilla Python, but with scientific data you should seek to vectorize (i.e. call one method/function/operator on the whole dataframe/array) rather than iterate (e.g. take an action for every row, column, element, etc) since it both leads to cleaner, shorter code, and is much faster (in fact, I managed to speed up a friend’s code by around one million times, taking 100 s every run to taking 0.001 s, mostly just by vectorizing it). Working on small, simple datasets this isn’t so important, but once you’re doing more complicated things, it will pay for itself very quickly.