Identify duplicate numbers and see if next column have series of numbers

i200yrs · June 14, 2023, 10:15am

Hello Python Experts…Please give a code to start. In the dataframe have 2 columns x and y.
Each columns have a series of numbers…I want to check x-column if have 3 duplicate numbers and check all numbers inline to y-column if have in series. And vice-versa…Example as below:
Where x-column have number 4qty of 29 duplicated and equivalanet number in y-column have series 82,83,84…Really appreciate any help…thanks a lot.

moosemaimer · June 14, 2023, 3:15pm

Not a pandas wizard, but I put together the following in regular Python:

data = [
    (11,82),
    (12,80),
    (18,69),
    ...
]

for n in range(len(data) - 2):
    if data[n][0] == data[n+1][0] == data[n+2][0]:
        if data[n][1] == data[n + 1][1] - 1 == data[n + 2][1] - 2:
            print(f'Match found at rows {n}-{n + 2}')

i200yrs · June 15, 2023, 3:46am

That one works if the duplicate number is in series…but if not it cannot detect…below is the sample.

x	y
58	27
68	28
38	42
72	49
47	50
19	54
29	62
74	65
75	65
18	70
29	82 <----
30	82
29	83 <----
30	83
29	84 <----
50	94
73	95
57	111
45	111
35	111
37	121
44	127

Hope somone can help…meanwhile, i continue study how to get it done…thanks

rob42 · June 15, 2023, 7:05am

Your data set has four 29s and you said:

Yet, you indicate the 29s:

29	82 <----
30	82
29	83 <----
30	83
29	84 <----

This is not a complete working solution, but it may help you out:

data = {
    'x': [58, 68, 38, 72, 47, 19, 29, 74, 75, 18, 29, 30, 29, 30, 29, 50, 73],
    'y': [27, 28, 42, 49, 50, 54, 62, 65, 65, 70, 82, 82, 83, 83, 84, 94, 95]
}

index_dup_x = []


for index, x in enumerate(data['x']):
    for dup_index, dup_x in enumerate(data['x']):
        if index == dup_index:
            pass
        else:
            if dup_x == x:
                index_dup_x.append((index, x))

index_dup_x = list(set(index_dup_x))

for index, y in enumerate(data['y']):
    for item in index_dup_x:
        if index == item[0]:
            print(f"x: {item[1]}\ty:{y}")

[output]

x: 29	y:62
x: 29	y:82
x: 30	y:82
x: 29	y:83
x: 30	y:83
x: 29	y:84

As you can see, it’s also listed the 30s, so you’ll need to modify the code so that you only get the required number of duplicated numbers for each column.

i200yrs · June 15, 2023, 9:46am

Sorry Rob…i have tried to modify but still no success…I will continue trying again and again…thanks

i200yrs · June 15, 2023, 9:47am

thank you…how to print the data value instead of index? thanks
And pls, ask a favor, pls explain why there’s -2 in below code…thanks

for n in range(len(data) - 2):

rob42 · June 15, 2023, 1:16pm

Maybe, if you post the code that’s not doing what you want it to do and explain what you’d like it to do (with examples of what the input looks like and and what the output should look like) I can offer some more advice or even write the code for you, but without a better understanding, I’ll be guessing, which makes any coded that a write, useless.

Just to add:

Another way to do this would be to make a dictionary of the indices, for the numbers that you want to locate. You would than know how many there are and where they are. This is a POC for column ‘X’:

data = {
    'x': [58, 68, 38, 72, 47, 19, 29, 74, 75, 18, 29, 30, 29, 30, 29, 50, 73],
    'y': [27, 28, 42, 49, 50, 54, 62, 65, 65, 70, 82, 82, 83, 83, 84, 94, 95]
}

duplicates = {}

for index, x in enumerate(data['x']):
    for dup_index, dup_x in enumerate(data['x']):
        if index == dup_index:
            pass
        else:
            if dup_x == x:
                if x not in duplicates:
                    duplicates[x] = [index]
                else:
                    if index not in duplicates[x]:
                        duplicates[x].append(index)

for dup in duplicates:
    print(
        f"{len(duplicates[dup])} x {dup} found in colunm 'X'. The indices are: {duplicates[dup]}")

moosemaimer · June 15, 2023, 1:53pm

The check was looking 2 values ahead of the loop index, so the loop would stop before running out of items and throwing an IndexError. If the numbers aren’t sequential, you might want to sort the data into ascending order before doing anything else with it.

i200yrs · June 16, 2023, 1:44am

Hello Rob…good day…Basically i have dataframe including X and Y columns. I want to detect at least 3 duplicates number under X and see its Y values are in sequence (and vice-versa). If duplicate values from X & Y does not pair with sequence then ignore it. Below is the sample of input and output. Thanks in advance.

example duplicate in X

Input	
X	Y
23	11
33	23  <---
65	45
78	43
34	78
33	24  <---
44	45
54	87
23	34
33	22  <---
58	54
34	95
33	23  <---
45	84
	
	
Output	
X	Y
33	22
33	23
33	24


example duplicate in Y

Input	
Y	X
11	23
32	45  <---
45	65
43	78
78	34
78	45  <---
45	44
87	54
34	23
76	45  <---
54	58
95	34
77	45  <---
84	45
	
	
Output	
X	Y
76	45
77	45
78	45

MRAB · June 16, 2023, 2:18am

OK, how about this:

from collections import defaultdict
import pandas as pd

df = pd.DataFrame(
    [
        (11, 23),
        (32, 45),
        (45, 65),
        (43, 78),
        (78, 34),
        (78, 45),
        (45, 44),
        (87, 54),
        (34, 23),
        (76, 45),
        (54, 58),
        (95, 34),
        (77, 45),
        (84, 45),
    ], columns=('Y', 'X')
)

entries = defaultdict(list)

for x, y in zip(df.loc[ : , 'X'], df.loc[ : , 'Y']):
    entries[x].append(y)

print('X\tY')

for x, y_list in entries.items():
    y_list.sort()

    for start in range(len(y_list) - 2):
        if y_list[start] == y_list[start + 1] - 1 == y_list[start + 2] - 2:
            for y in y_list[start : start + 3]:
                print(f'{x}\t{y}')

rob42 · June 16, 2023, 5:37am

No worries.

I’m still not too sure about what you want out of the second example, but maybe the solution is already there in the post from @MRAB.

In any case, this is a modified version of the script from my last post:

data = {
    'x': [23, 33, 65, 78, 34, 33, 44, 54, 23, 33, 58, 34, 33, 45],
    'y': [11, 23, 45, 43, 78, 24, 45, 87, 34, 22, 54, 95, 23, 84]
}

duplicates = {}

for index, x in enumerate(data['x']):
    for dup_index, dup_x in enumerate(data['x']):
        if index == dup_index:
            pass
        else:
            if dup_x == x:
                if x not in duplicates:
                    duplicates[x] = [data['y'][index]]
                else:
                    duplicates[x].append(data['y'][index])

discard = []

for key in duplicates:
    if len(duplicates[key]) < 4:
        discard.append(key)
for key in discard:
    duplicates.pop(key)

for x in duplicates:
    y_values = duplicates[x]
    y_values = list(set(y_values))
    y_values.sort()
    duplicates[x] = y_values

print("x\ty")
for x in duplicates:
    for y in duplicates[x]:
        print(f"{x}\t{y}")

[output]

i200yrs · June 16, 2023, 5:39am

Thanks Matt…it works…by the way, how about if there’s no detection.
The results like below:

X     Y

How i can make a code to detect if X and Y is empty…I tried len and info and others method but got error. Thanks a lot to you.

aivarpaalberg · June 16, 2023, 5:52am

For first part using pandas tools:

get unique values in X
get subframes for every unique value
check whether there is at least three rows

import pandas as pd

data = {
    'X': [58, 68, 38, 72, 47, 19, 29, 74, 75, 18, 29, 30, 29, 30, 29, 50, 73],
    'Y': [27, 28, 42, 49, 50, 54, 62, 65, 65, 70, 82, 82, 83, 83, 84, 94, 95]
}

df = pd.DataFrame(data)
uniq = pd.unique(df['X'])

for value in uniq:
     mask = df['X'].values == value   # or simply df['Y'] == value
     subframe = df[mask]
     if 3 <= subframe.index.size:
         print(subframe)

# will output
    X   Y
6   29  62
10  29  82
12  29  83
14  29  84

For me there is some ambiguity how should Y column be checked. Should entire column be increasing? If so pd.Series.is_monotonic_increasing could be used. If not, what are the exact requirements/conditions for increase?

MRAB · June 16, 2023, 4:45pm

Inside of printing the results in the loop, just store them in a list and then print them out afterwards.

Topic		Replies	Views
Duplicate values Python Help	12	656	May 28, 2022
Counting Duplicates and Getting Sample Values Python Help help	1	413	July 3, 2022
For Loop Compared to Static Number Python Help	6	451	December 7, 2022
how to replace the duplicate numbers in column " received " in the attached sample with value zero but keeping the last or first value table using Python Python Help help	2	379	August 10, 2022
Switch values of field in dataframe if found Python Help help	10	908	December 5, 2022

Identify duplicate numbers and see if next column have series of numbers

Related Topics