Dataframes and Dictionary 'Lookups'

I have a problem coordinating ‘looked up’ data in a dictionary with new columns I need to create in a base dataframe that is constructed first. Here is a more detailed layout of the symptoms and my goals.

Base or original dataframe compiled from earlier Python Processes(REGULAR Symbol Data)

       DateSym     Symbol  ...  'FIELD1'       'FIELD2'

0 20230110DDD DDD 0.395 -.5
1 20230110EEE EEE .16 6
2 20230110FFF FFF 1.22 22.1
3 20230111DDD DDD .335 -13.2
4 20230111EEE EEE .3341 -0.3967
5 20230111FFF FFF 3.5 4.2

Dictionary containing values I want to ‘lookup’ based on a concatenated key made up of a symbol e.g. ‘AAA’ and a date in ‘yyyymmdd’ format.(SPECIAL Symbol Data)

{'20230110AAA ': [‘AAA’, ‘22.01’, ‘322.11’], ’ 20230110BBB ': [‘BBB’, ‘230.19’, ‘33.25’], ’ 20230110CCC ': [‘CCC’, ‘18.19’, ‘13.25’]
'20230111AAA ': [‘AAA’, ‘.056’, ‘32.65’], ’ 20230111BBB ': [‘BBB’, ‘23.18’, ‘3.16’], ’ 20230111CCC ': [‘CCC’, ‘15.15’, 9.01] }

My system design requires that I string ‘SPECIAL’ data from the lookup all out on ONE ROW—REGULAR Symbol—data SPECIAL Symbol AAA—data then SPECIAL Symbol BBB—data etc.

  1. I take the ‘Date part’ from column ‘DateSym’ in REGULAR Symbol Data, create a loop attaching each of a LIST of SPECIAL Symbols[AAA,BBB,CCC] and concatenate a key, which looks like
    ‘20230110AAA’ and ‘lookup’ the values for that key in the dictionary and want to create NEW COLUMNS in the original dataframe for each of the multiple associated values in the dictionary
    For that key.

HERE IS WHAT I WANT:
DateSym Symbol … ‘FIELD1’ ‘FIELD2’ ‘FIELD3’ ‘FIELD4’ ‘FIELD5’ ‘FIELD6’ ‘FIELD7’ ‘FIELD8’ ‘FIELD9’ ‘FIELD10’ ‘FIELD11’
0 20230110DDD DDD 0.395 -.5 AAA 22.01 322.11 BBB 230.19 33.25 CCC 18.19 13.25
1 20230110EEE EEE .16 6 AAA 22.01 322.11 BBB 230.19 33.25 CCC 18.19 13.25
2 20230110FFF FFF 1.22 22.1 AAA 22.01 322.11 BBB 230.19 33.25 CCC 18.19 13.25
3 20230111DDD DDD .335 -13.2 AAA .056 32.65 BBB 23.18 3.16 CCC 15.15 9.01
4 20230111EEE EEE .3341 -0.3967 AAA .056 32.65 BBB 23.18 3.16 CCC 15.15 9.01
5 20230111FFF FFF 3.5 4.2 AAA .056 32.65 BBB 23.18 3.16 CCC 15.15 9.01

THIS IS WHAT I GET(note only the LAST LOOKED UP values on each and every record—LIST of special symbols has 3 elements 0-2, I only get values associated with final iteration or third element(2)):
DateSym Symbol … ‘FIELD1’ ‘FIELD2’ ‘FIELD3’ ‘FIELD4’ ‘FIELD5’ ‘FIELD6’ ‘FIELD7’ ‘FIELD8’ ‘FIELD9’ ‘FIELD10’ ‘FIELD11’
0 20230110DDD DDD 0.395 -.5 AAA .056 32.65 BBB 23.18 3.16 CCC 15.15 9.01
1 20230110EEE EEE .16 6 AAA .056 32.65 BBB 23.18 3.16 CCC 15.15 9.01
2 20230110FFF FFF 1.22 22.1 AAA .056 32.65 BBB 23.18 3.16 CCC 15.15 9.01
3 20230111DDD DDD .335 -13.2 AAA .056 32.65 BBB 23.18 3.16 CCC 15.15 9.01
4 20230111EEE EEE .3341 -0.3967 AAA .056 32.65 BBB 23.18 3.16 CCC 15.15 9.01
5 20230111FFF FFF 3.5 4.2 AAA .056 32.65 BBB 23.18 3.16 CCC 15.15 9.01

This does NOT seem like a complex thing to do in any programming language. I just can’t seem to get it to work. I’ve tried various ideas from online suggestions. Can suggest a
Piece of code that can handle this? Thank you in advance.

Please read the sticky and format the data examples like code, so that we can properly see how you intend for things to line up.

Keep in mind that dictionaries cannot contain duplicate keys. Trying to specify them like this will result in the older values being overwritten:

>>> {'a': 1, 'b': 2, 'c': 3, 'a': 4, 'b': 5, 'c': 6}
{'a': 4, 'b': 5, 'c': 6}

Thank you for your response, but note, there are no duplicate keys in the dictionary. The concatenated key is the key and they are unique. It’s basically a Date+Special Symbol and they are all unique. Each keyed dictionary entry has multiple values I need to extract and place on a single row with each Regular Symbol I loop through from the primary and previously created dataframe.

As a separate matter, can you direct me to the ‘sticky’ please? And thank you again…

I misread the data.

The sticky (pinned) thread should appear at the top of the category for you, unless you unpinned it. I seem to have done so, and lost track, but I think this is it:

As for why you get the wrong result, we can only possibly comment on code that is actually shown to us.

Thanks. Ok I will code out this subsection and get back to this post. It will take me a little while.

Ok…here is the code with some commentary to guide it.

import pandas as pd
SpecialData_Dictionary = {'20230110AAA': ['AAA', '22.01', '322.11'],  \
                          '20230110BBB': ['BBB', '230.19', '33.25'], \
                          '20230110CCC': ['CCC', '18.19', '13.25'], \
                          '20230111AAA': ['AAA', '.056', '32.65'], \
                          '20230111BBB': ['BBB', '23.18', '3.16'], \
                          '20230111CCC': ['CCC', '15.15', '9.01'], \
                          '20230112AAA': ['AAA', '34.56', '66.21'], \
                          '20230112BBB': ['BBB', '00.01', '13.25'], \
                          '20230112CCC': ['CCC', '48.47', '16.11']}
### initialize sample data as list of lists
sampledata = [['20230110ONE', 'ONE','16.23','.037' ,'anytext1'], ['20230110TWO', 'TWO','19.01','1.66' ,'anytext2'], \
             ['20230110333', '333','15.01','.099' ,'anytext3'], ['20230111ONE', 'ONE','72.25','23.22' ,'anytext4'], \
             ['20230111TWO', 'TWO','44.17','1.62' ,'anytext5'], ['20230111333', '333','13.12','86.05', 'anytext6'], \
             ['20230112ONE', 'ONE','53','65.79' ,'anytext7'], ['20230112TWO', 'TWO','32.99','2.67', 'anytext8'], \
             ['20230112333', '333','43.43','49.55','anytext9']]
  
# Create the pandas DataFrame
regulardata = pd.DataFrame(sampledata, columns=['SymDate', 'Symbol','Price1','Price2','Text1'])
  
# print dataframe BEFORE Image.
print("BEFORE!!!: ", regulardata)
#Begin process to loop and link SPECIAL DATA in dictionary as NEW FIELDS in existing dataframe of REGULAR DATA.
#Every single day there is SPECIAL DATA for 3 and only 3 symbols and I know what those symbols are AAA,BBB,CCC.
#Create a list of those 3 symbols
SpecialSymbolList=["AAA","BBB","CCC"]
Outerindex=0
#Loop through each entry in the dataframe
while Outerindex < len(regulardata):
    Symdatepart=(regulardata['SymDate'][Outerindex])[:8]
    Innerindex=0
    while Innerindex < len(SpecialSymbolList):
        #Create a lookup key for the dictionary by taking the date part from each regular and the 'current' value returned
        #from the SpecialSymbolList, a concatenated key that should have an entry in the dictionary
        symdatepartPLUS=Symdatepart + SpecialSymbolList[Innerindex]
        if symdatepartPLUS in SpecialData_Dictionary:
            regulardata["'" + SpecialSymbolList[Innerindex] + "NewField1"]=SpecialData_Dictionary[symdatepartPLUS][0]
            regulardata["'" + SpecialSymbolList[Innerindex] + "NewField2"]=SpecialData_Dictionary[symdatepartPLUS][1]
            regulardata["'" + SpecialSymbolList[Innerindex] + "NewField3"]=SpecialData_Dictionary[symdatepartPLUS][2]
            Innerindex=Innerindex+1
        else:
            print("Not in Dictionary")
            regulardata["'" + SpecialSymbolList[Innerindex] + "NewField1"]="No Data"
            regulardata["'" + SpecialSymbolList[Innerindex] + "NewField2"]="No Data"
            regulardata["'" + SpecialSymbolList[Innerindex] + "NewField3"]="No Data"
            Innerindex=Innerindex+1
            print(symdatepartPLUS)
        Outerindex=Outerindex+1
# print dataframe AFTER Image. I should have NINE new fields, 3 each per SpecialData_Dictionary keys and corresponding and
# DISTINCT values per each key. Not happening......instead I get a repeat of LAST logical looked up key:values.
print("AFTER!!!: ",regulardata)

Hi,

Any suggestions on this code and how to fix the problem? Thank you in advance.

Here, you are setting the column regulardata["'" + SpecialSymbolList[Innerindex] + "NewField1"] with a scalar value, SpecialData_Dictionary[symdatepartPLUS][0]. This sets the whole column to the same value.

You do this three times for each of the columns ''AAANewField1', ''AAANewField2', ''AAANewField3', ''BBBNewField1', ''BBBNewField2', ''BBBNewField3', ''CCCNewField1', ''CCCNewField2', ''CCCNewField3', overwriting them completely each time.

To solve this, you could create each column before the loop with uninitialized data, then set the actual value via both row and column index. For example:

newcols = ["'" + iidx + f"NewField{i+1}" for iidx in SpecialSymbolList for i in range(3)]
regulardata = regulardata.assign(**{k: None for k in newcols})
while OuterIndex < len(regulardata):
    ...
    while InnerIndex < len(SpecialSymbolList):
        ...
        if symdatepartPLUS in SpecialData_Dictionary:
            regulardata["'" + SpecialSymbolList[Innerindex] + "NewField1"][Outerindex] = SpecialData_Dictionary[symdatepartPLUS][0]
            ...

Sample output:

       SymDate Symbol Price1  ... 'CCCNewField1 'CCCNewField2 'CCCNewField3
0  20230110ONE    ONE  16.23  ...          None          None          None
1  20230110TWO    TWO  19.01  ...          None          None          None
2  20230110333    333  15.01  ...           CCC         18.19         13.25
3  20230111ONE    ONE  72.25  ...          None          None          None
4  20230111TWO    TWO  44.17  ...          None          None          None
5  20230111333    333  13.12  ...           CCC         15.15          9.01
6  20230112ONE    ONE     53  ...          None          None          None
7  20230112TWO    TWO  32.99  ...          None          None          None
8  20230112333    333  43.43  ...           CCC         48.47         16.11

Is that closer to what you want?

YES! Much closer thank you. However, 2 things. First, I trust that the ‘sample data’ is simply incomplete because there are valid and unique data points, or should be, for each and every column ‘looked up’ by key in that dictionary and there shouldn’t be any ‘nones’. Secondly, I am a ‘decent’ programmer and somewhat new to Python and some of the syntax remains cryptic to me. As in:

iidx + f"NewField{i+1}

Is the iidx variable–which I assume to be a counter or index to a loop—initialized anywhere or merely stating it, initializes it? Also the ‘f’ before NewField? What on earth is that?

Those things said, I ‘think’ you’ve added just two lines of code above my original. I will await any commentary and then incorporate those two lines an experiment to see if I can get this to function the way I need it to. THANK YOU A MILLION for the help!

["'" + iidx + f"NewField{i+1}" for iidx in SpecialSymbolList for i in range(3)]

is a list comprehension:

The iidx is initialized by the for iidx in SpecialSymbolList for i in range(3), just like in the similar for-loop syntax.

f"NewField{i+1}" is an f-string (AFAIK, not an official name, but what everyone ended up calling it):

The value of i + 1 is interpolated into the string, similarly to "NewField{}".format(f + 1).

Wow. Great explanation. I will study and apply your suggested fix reporting back if any other issues. Just two lines of code above my original? Awesome. Thanks again.

Another bit of syntactic sugar that might look unfamiliar is regulardata = regulardata.assign(**{k: None for k in newcols}). This is a dictionary comprehension (similar to a list comprehension, but creates a dict instead of a list), combined with dictionary unpacking. The ** notation means “unpack the dictionary into keyword=value pairs”. In this case, regulardata = regulardata.assign(**{k: None for k in newcols}) is equivalent to

regulardata = regulardata.assign(
    AAANewField1=None,
    AAANewField2=None,
    AAANewField3=None,
    BBBNewField1=None,
    BBBNewField2=None,
    BBBNewField3=None,
    CCCNewField1=None,
    CCCNewField2=None,
    CCCNewField3=None
)

I’m trying to understand what your end goal is here. Would something like this work?

import pandas as pd

SpecialData_Dictionary = {
    "20230110AAA": ["AAA", "22.01", "322.11"],
    "20230110BBB": ["BBB", "230.19", "33.25"],
    "20230110CCC": ["CCC", "18.19", "13.25"],
    "20230111AAA": ["AAA", ".056", "32.65"],
    "20230111BBB": ["BBB", "23.18", "3.16"],
    "20230111CCC": ["CCC", "15.15", "9.01"],
    "20230112AAA": ["AAA", "34.56", "66.21"],
    "20230112BBB": ["BBB", "00.01", "13.25"],
    "20230112CCC": ["CCC", "48.47", "16.11"],
}

sampledata = [
    ["20230110ONE", "ONE", "16.23", ".037", "anytext1"],
    ["20230110TWO", "TWO", "19.01", "1.66", "anytext2"],
    ["20230110333", "333", "15.01", ".099", "anytext3"],
    ["20230111ONE", "ONE", "72.25", "23.22", "anytext4"],
    ["20230111TWO", "TWO", "44.17", "1.62", "anytext5"],
    ["20230111333", "333", "13.12", "86.05", "anytext6"],
    ["20230112ONE", "ONE", "53", "65.79", "anytext7"],
    ["20230112TWO", "TWO", "32.99", "2.67", "anytext8"],
    ["20230112333", "333", "43.43", "49.55", "anytext9"],
]

column_names = ["SymDate", "Symbol", "Price1", "Price2", "Text1"]
regulardata = pd.DataFrame(
    sampledata,
    columns=column_names,
)
regulardata.set_index("SymDate", inplace=True)
specialdata = pd.DataFrame.from_dict(
    SpecialData_Dictionary, orient="index", columns=column_names[1:-1]
)
specialdata.index.name = column_names[0]
combineddata = pd.concat([regulardata, specialdata])
print(combineddata)

Output:

            Symbol  Price1  Price2     Text1
SymDate                                     
20230110ONE    ONE   16.23    .037  anytext1
20230110TWO    TWO   19.01    1.66  anytext2
20230110333    333   15.01    .099  anytext3
20230111ONE    ONE   72.25   23.22  anytext4
20230111TWO    TWO   44.17    1.62  anytext5
20230111333    333   13.12   86.05  anytext6
20230112ONE    ONE      53   65.79  anytext7
20230112TWO    TWO   32.99    2.67  anytext8
20230112333    333   43.43   49.55  anytext9
20230110AAA    AAA   22.01  322.11       NaN
20230110BBB    BBB  230.19   33.25       NaN
20230110CCC    CCC   18.19   13.25       NaN
20230111AAA    AAA    .056   32.65       NaN
20230111BBB    BBB   23.18    3.16       NaN
20230111CCC    CCC   15.15    9.01       NaN
20230112AAA    AAA   34.56   66.21       NaN
20230112BBB    BBB   00.01   13.25       NaN
20230112CCC    CCC   48.47   16.11       NaN

Or do you specifically want to add the special symbols as extra columns? If so, how should the SymDates be handled? Should a single row contain SymDates for both the ONE and CCC symbols, for example?

Thank you. I am seeking to add all of the columns from each of the dictionary lookups(3 each per input dataframe row) as new columns in the original dataframe in the SAME row as each input row from that dataframe. The Symdate NEED NOT be repeated in those columns. In the original I might have a detail symbol of ‘ONE’, as in the first row of sampledata. That is the symbol column and IF it were the ONLY column in the sampledata dataframe, and IF the date being analyzed was 20230110 the output would look like ‘ONE’ ‘AAA’ ‘22.01’ ‘322.11’ ‘BBB’ ‘230.19’ ‘33.25’ ‘CCC’ ‘18.19’ ‘13.25’.

That said…I DO want to include ALL of the original ‘sampledata’ columns in the revised output, not JUST that symbol row in my example paragraph above…

…meant to say ‘NOT just that symbol COLUMN in my example paragraph above…’

So, I’m still not certain how to translate your good advice and techniques into the data I need on each of those rows within my inner loop. Can you please provide guidance?

First off, as a general piece of advice when working with pandas DataFrames: They are designed to be created and manipulated row-by-row, column-by-column, or even everything-at-once, NOT element-by-element. Whenever you find yourself reaching for an iterative solution when working with DataFrames, you should stop and think:

  • Do I really need to use a loop here?
  • If yes, is a DataFrame really the correct data model in this case?

So, rather than add your special symbols one at a time in a loop, could something like this work for you?

import pandas as pd

column_names = ["SymDate", "Symbol", "Price1", "Price2", "Text1"]
regular_data = pd.DataFrame(
    [
        ["20230110", "ONE", "16.23", ".037", "anytext1"],
        ["20230110", "TWO", "19.01", "1.66", "anytext2"],
        ["20230110", "333", "15.01", ".099", "anytext3"],
        ["20230111", "ONE", "72.25", "23.22", "anytext4"],
        ["20230111", "TWO", "44.17", "1.62", "anytext5"],
        ["20230111", "333", "13.12", "86.05", "anytext6"],
        ["20230112", "ONE", "53", "65.79", "anytext7"],
        ["20230112", "TWO", "32.99", "2.67", "anytext8"],
        ["20230112", "333", "43.43", "49.55", "anytext9"],
    ],
    columns=column_names,
)
special_data = pd.DataFrame(
    [
        ["20230110", "AAA", "22.01", "322.11"],
        ["20230110", "BBB", "230.19", "33.25"],
        ["20230110", "CCC", "18.19", "13.25"],
        ["20230111", "AAA", ".056", "32.65"],
        ["20230111", "BBB", "23.18", "3.16"],
        ["20230111", "CCC", "15.15", "9.01"],
        ["20230112", "AAA", "34.56", "66.21"],
        ["20230112", "BBB", "00.01", "13.25"],
        ["20230112", "CCC", "48.47", "16.11"],
    ],
    columns=column_names[:-1],
)
combined_data = pd.concat([regular_data, special_data])
print(combined_data)

Output:

    SymDate Symbol  Price1  Price2     Text1
0  20230110    ONE   16.23    .037  anytext1
1  20230110    TWO   19.01    1.66  anytext2
2  20230110    333   15.01    .099  anytext3
3  20230111    ONE   72.25   23.22  anytext4
4  20230111    TWO   44.17    1.62  anytext5
5  20230111    333   13.12   86.05  anytext6
6  20230112    ONE      53   65.79  anytext7
7  20230112    TWO   32.99    2.67  anytext8
8  20230112    333   43.43   49.55  anytext9
0  20230110    AAA   22.01  322.11       NaN
1  20230110    BBB  230.19   33.25       NaN
2  20230110    CCC   18.19   13.25       NaN
3  20230111    AAA    .056   32.65       NaN
4  20230111    BBB   23.18    3.16       NaN
5  20230111    CCC   15.15    9.01       NaN
6  20230112    AAA   34.56   66.21       NaN
7  20230112    BBB   00.01   13.25       NaN
8  20230112    CCC   48.47   16.11       NaN

You can now index by e.g. SymDate

>>> combined_data.loc[combined_data["SymDate"] == "20230111"]

    SymDate Symbol Price1 Price2     Text1
3  20230111    ONE  72.25  23.22  anytext4
4  20230111    TWO  44.17   1.62  anytext5
5  20230111    333  13.12  86.05  anytext6
3  20230111    AAA   .056  32.65       NaN
4  20230111    BBB  23.18   3.16       NaN
5  20230111    CCC  15.15   9.01       NaN

or by Symbol

>>> combined_data.loc[combined_data["Symbol"] == "AAA"]
 
    SymDate Symbol Price1  Price2 Text1
0  20230110    AAA  22.01  322.11   NaN
3  20230111    AAA   .056   32.65   NaN
6  20230112    AAA  34.56   66.21   NaN

The output still doesn’t look right or as intended. The 'NaN’s in the Text1 column don’t make sense to me since there is valid data in the original dataframe and across all columns in the 3 sets of data I am trying to match in the dictionary via lookup. Tricky business these dataframes! :smile: A word about the overall system design: I have two identical streams of logic which take data and use aggregation and ranking etc. which make the dataframe a good model for each stream’s final ‘output’ if you will. The difference between the two streams is that the special symbols have 3 rows for each row in the regular data. The goal was to create one ‘massive row’ of data with the regular data to the left, followed by special row 1 data, special row 2 data and special row 3 data in succession on that single row. Rows are accumulated into a presumed XL sheet upon which formulas can be written to filter and research strategies regarding the relationship between regular data and one or more or all of the special symbol data. MUCH easier to do this(at least for me) if the data is strung out in one row like this. Do you suppose it would be easier to convert the regular data into a .csv file, then read it back in and attempt my ‘dictionary lookup loop’ from that model and output to another .csv???

Is there valid data though? In the example data you posted

SpecialData_Dictionary = {
    "20230110AAA": ["AAA", "22.01", "322.11"],
    "20230110BBB": ["BBB", "230.19", "33.25"],
    "20230110CCC": ["CCC", "18.19", "13.25"],
    "20230111AAA": ["AAA", ".056", "32.65"],
    "20230111BBB": ["BBB", "23.18", "3.16"],
    "20230111CCC": ["CCC", "15.15", "9.01"],
    "20230112AAA": ["AAA", "34.56", "66.21"],
    "20230112BBB": ["BBB", "00.01", "13.25"],
    "20230112CCC": ["CCC", "48.47", "16.11"],
}

there is no data for the Text1 column.

Again, I’m not seeing that in your example data. Perhaps I am misinterpreting it. To me, it looks like a row contains one entry each of ['SymDate', 'Symbol','Price1','Price2','Text1'], and this is true for the regular data and the special data both (except the latter is missing the Text1 field).

I guess I don’t understand what’s special about the “special data”. Could you clarify?

‘Regular’ Data is similarly formatted stock symbol data(IBM, AMZN, CAT etc.). The mix of those symbols can vary greatly from day to day or period to period as I choose to trade or not trade them. ‘Special’ Data is also a stock symbol, but specifically for the 3 major market indices I study, one for S and P 500 or ‘SPY’, one for Nasdaq or ‘QQQ’, one for smaller stocks or ‘IWM’----those NEVER change. I look at those every single day and the performance of those relates to ANY regular stock I might look at on a given day. AND I want to repeat the data points on the ‘Special’ symbols along the rows of the ‘Regular’ symbols so that I can easily assess PER REGULAR symbol how it might have performed RELATIVE to the peformance of the ‘SPECIAL’ symbol(s) on it’s row.