Find dupliates in one column, check the oldest date, return the value from another column in the row of the oldest date into row of newest date

I am a newbie and have been unable to work out how to get the desired result.

I have multiple columns but for the purposes of this I will use only 3.

Number Date Code
1111 08/09/2004 aa1115
2583 07/05/2012 an5566
2183 09/12/2022 aa1115
1238 11/08/2014 yg8963
2186 01/01/2024 us1238
5869 05/02/2024 us1238

I want to complete the following:

  • check if the ‘code’ column has a duplicate
  • if duplicate I want to check the ‘date’ column for the oldest date and pull get the value from the ‘number’ column
  • place value of oldest date ‘number’ into new column ‘Required Output’ for the newest date of the duplicate with an ‘a’ at the end
  • if no duplicate use the value from ‘number’ column
Number Date Code Required Output
1111 08/09/2004 aa1115 1111
2583 07/05/2012 an5566 2583
2183 09/12/2022 aa1115 1111a
1238 11/08/2014 yg8963 1238
2186 01/01/2024 us1238 2186
5869 05/02/2024 us1238 2186a

Hopefully that makes sense.

Code I am using is

() min_dates = df.groupy([code'])['date'].min() df['min_date'] = df.apply(lambda row: min_dates.loc[row['code"]],axis=a) ()

this gives me the below to get the min date for the duplicates but I am not sure how to get the oldest date ‘number’ into new column ‘Required Output’ for the newest date of the duplicate with an ‘a’ at the end

Number Date Code min date
1111 08/09/2004 aa1115 08/09/2004
2583 07/05/2012 an5566 07/05/2012
2183 09/12/2022 aa1115 08/09/2004
1238 11/08/2014 yg8963 11/08/2014
2186 01/01/2024 us1238 01/01/2024
5869 05/02/2024 us1238 01/01/2024

Where’s your code? Also, please add it in a code block.

What isn’t working with your code?

I have updated above but am using the below but not sure how to advance to get the ‘number’ of the row with the min date with an ‘a’ into the newer date row.

min_dates = df.groupy([code’])[‘date’].min()
df[‘min_date’] = df.apply(lambda row: min_dates.loc[row['code"]],axis=a)

Hi,

here is a script that satisfies the first two bullet points as stated above:

There is enough here to work from that you can work out the last two bullet points.

from pathlib import Path

FILE_PATH = Path('/Users/myComp/AppData/Roaming/JetBrains/PyCharmCE2024.1/scratches/temp_search_duplicate.txt')

def obtain_oldest_date_code(filename):
    with open(filename) as file:
        data = file.readlines()

        data_pairs = {}    # Create dictionary for date / code pairs
        codes = []         # Duplicate code list

        for line in data:  # iterate through each line

            columns = line.split()  # Separate each data column

            try:  # Only read non-string values in first column

                if type(int(columns[0])) is int:

                    data_pairs[columns[1]] = columns[2]
                    codes.append(columns[2])

            except ValueError:  # Ignore if a string value (i.e., column titles not part of analysis)
                pass

    duplicate_codes = list({x for x in codes if codes.count(x) > 1})  # Create list of duplicate values

    print('\nDuplicate code(s) and oldest date(s).')
    print('------------------------------------')
    print('\nCode             Date')
    print('----             ----')

    for code in duplicate_codes:

        dates_list = []

        for key, value in data_pairs.items():

            if code == value:

                dates_list.append(key)

        oldest_date = min(dates_list)
        print(code, oldest_date, sep='     ')


obtain_oldest_date_code(FILE_PATH)


Here are the test results:

script_test_results

Good luck!