Find dupliates in one column, check the oldest date, return the value from another column in the row of the oldest date into row of newest date

Iamamanda · May 22, 2024, 2:25am

I am a newbie and have been unable to work out how to get the desired result.

I have multiple columns but for the purposes of this I will use only 3.

Number	Date	Code
1111	08/09/2004	aa1115
2583	07/05/2012	an5566
2183	09/12/2022	aa1115
1238	11/08/2014	yg8963
2186	01/01/2024	us1238
5869	05/02/2024	us1238

I want to complete the following:

check if the ‘code’ column has a duplicate
if duplicate I want to check the ‘date’ column for the oldest date and pull get the value from the ‘number’ column
place value of oldest date ‘number’ into new column ‘Required Output’ for the newest date of the duplicate with an ‘a’ at the end
if no duplicate use the value from ‘number’ column

Number	Date	Code	Required Output
1111	08/09/2004	aa1115	1111
2583	07/05/2012	an5566	2583
2183	09/12/2022	aa1115	1111a
1238	11/08/2014	yg8963	1238
2186	01/01/2024	us1238	2186
5869	05/02/2024	us1238	2186a

Hopefully that makes sense.

Code I am using is

() min_dates = df.groupy([code'])['date'].min() df['min_date'] = df.apply(lambda row: min_dates.loc[row['code"]],axis=a) ()

this gives me the below to get the min date for the duplicates but I am not sure how to get the oldest date ‘number’ into new column ‘Required Output’ for the newest date of the duplicate with an ‘a’ at the end

Number	Date	Code	min date
1111	08/09/2004	aa1115	08/09/2004
2583	07/05/2012	an5566	07/05/2012
2183	09/12/2022	aa1115	08/09/2004
1238	11/08/2014	yg8963	11/08/2014
2186	01/01/2024	us1238	01/01/2024
5869	05/02/2024	us1238	01/01/2024

DerSchinken · May 22, 2024, 7:00am

Where’s your code? Also, please add it in a code block.

What isn’t working with your code?

Iamamanda · May 22, 2024, 7:12pm

I have updated above but am using the below but not sure how to advance to get the ‘number’ of the row with the min date with an ‘a’ into the newer date row.

min_dates = df.groupy([code’])[‘date’].min()
df[‘min_date’] = df.apply(lambda row: min_dates.loc[row['code"]],axis=a)

onePythonUser · May 22, 2024, 11:40pm

Hi,

here is a script that satisfies the first two bullet points as stated above:

There is enough here to work from that you can work out the last two bullet points.

from pathlib import Path

FILE_PATH = Path('/Users/myComp/AppData/Roaming/JetBrains/PyCharmCE2024.1/scratches/temp_search_duplicate.txt')

def obtain_oldest_date_code(filename):
    with open(filename) as file:
        data = file.readlines()

        data_pairs = {}    # Create dictionary for date / code pairs
        codes = []         # Duplicate code list

        for line in data:  # iterate through each line

            columns = line.split()  # Separate each data column

            try:  # Only read non-string values in first column

                if type(int(columns[0])) is int:

                    data_pairs[columns[1]] = columns[2]
                    codes.append(columns[2])

            except ValueError:  # Ignore if a string value (i.e., column titles not part of analysis)
                pass

    duplicate_codes = list({x for x in codes if codes.count(x) > 1})  # Create list of duplicate values

    print('\nDuplicate code(s) and oldest date(s).')
    print('------------------------------------')
    print('\nCode             Date')
    print('----             ----')

    for code in duplicate_codes:

        dates_list = []

        for key, value in data_pairs.items():

            if code == value:

                dates_list.append(key)

        oldest_date = min(dates_list)
        print(code, oldest_date, sep='     ')


obtain_oldest_date_code(FILE_PATH)

Here are the test results:

script_test_results

Good luck!