Identify duplicate numbers and see if next column have series of numbers

Hello Python Experts…Please give a code to start. In the dataframe have 2 columns x and y.
Each columns have a series of numbers…I want to check x-column if have 3 duplicate numbers and check all numbers inline to y-column if have in series. And vice-versa…Example as below:
Where x-column have number 4qty of 29 duplicated and equivalanet number in y-column have series 82,83,84…Really appreciate any help…thanks a lot.

x	y
11	82
12	80
18	69
19	61
22	32
24	55
25	34
25	108
26	90
27	99
29	82  *
29	83  *
29	84  *
29	90
30	82
30	83
31	47
32	33
35	85
47	76

Not a pandas wizard, but I put together the following in regular Python:

data = [
    (11,82),
    (12,80),
    (18,69),
    ...
]

for n in range(len(data) - 2):
    if data[n][0] == data[n+1][0] == data[n+2][0]:
        if data[n][1] == data[n + 1][1] - 1 == data[n + 2][1] - 2:
            print(f'Match found at rows {n}-{n + 2}')

1 Like

That one works if the duplicate number is in series…but if not it cannot detect…below is the sample.

x	y
58	27
68	28
38	42
72	49
47	50
19	54
29	62
74	65
75	65
18	70
29	82 <----
30	82
29	83 <----
30	83
29	84 <----
50	94
73	95
57	111
45	111
35	111
37	121
44	127

Hope somone can help…meanwhile, i continue study how to get it done…thanks

Your data set has four 29s and you said:

Yet, you indicate the 29s:

29	82 <----
30	82
29	83 <----
30	83
29	84 <----

This is not a complete working solution, but it may help you out:

data = {
    'x': [58, 68, 38, 72, 47, 19, 29, 74, 75, 18, 29, 30, 29, 30, 29, 50, 73],
    'y': [27, 28, 42, 49, 50, 54, 62, 65, 65, 70, 82, 82, 83, 83, 84, 94, 95]
}

index_dup_x = []


for index, x in enumerate(data['x']):
    for dup_index, dup_x in enumerate(data['x']):
        if index == dup_index:
            pass
        else:
            if dup_x == x:
                index_dup_x.append((index, x))

index_dup_x = list(set(index_dup_x))

for index, y in enumerate(data['y']):
    for item in index_dup_x:
        if index == item[0]:
            print(f"x: {item[1]}\ty:{y}")

[output]

x: 29	y:62
x: 29	y:82
x: 30	y:82
x: 29	y:83
x: 30	y:83
x: 29	y:84

As you can see, it’s also listed the 30s, so you’ll need to modify the code so that you only get the required number of duplicated numbers for each column.

1 Like

Sorry Rob…i have tried to modify but still no success…I will continue trying again and again…thanks

thank you…how to print the data value instead of index? thanks
And pls, ask a favor, pls explain why there’s -2 in below code…thanks

for n in range(len(data) - 2):

Maybe, if you post the code that’s not doing what you want it to do and explain what you’d like it to do (with examples of what the input looks like and and what the output should look like) I can offer some more advice or even write the code for you, but without a better understanding, I’ll be guessing, which makes any coded that a write, useless.

Just to add:

Another way to do this would be to make a dictionary of the indices, for the numbers that you want to locate. You would than know how many there are and where they are. This is a POC for column ‘X’:

data = {
    'x': [58, 68, 38, 72, 47, 19, 29, 74, 75, 18, 29, 30, 29, 30, 29, 50, 73],
    'y': [27, 28, 42, 49, 50, 54, 62, 65, 65, 70, 82, 82, 83, 83, 84, 94, 95]
}

duplicates = {}

for index, x in enumerate(data['x']):
    for dup_index, dup_x in enumerate(data['x']):
        if index == dup_index:
            pass
        else:
            if dup_x == x:
                if x not in duplicates:
                    duplicates[x] = [index]
                else:
                    if index not in duplicates[x]:
                        duplicates[x].append(index)

for dup in duplicates:
    print(
        f"{len(duplicates[dup])} x {dup} found in colunm 'X'. The indices are: {duplicates[dup]}")
1 Like

The check was looking 2 values ahead of the loop index, so the loop would stop before running out of items and throwing an IndexError. If the numbers aren’t sequential, you might want to sort the data into ascending order before doing anything else with it.

1 Like

Hello Rob…good day…Basically i have dataframe including X and Y columns. I want to detect at least 3 duplicates number under X and see its Y values are in sequence (and vice-versa). If duplicate values from X & Y does not pair with sequence then ignore it. Below is the sample of input and output. Thanks in advance.

example duplicate in X

Input	
X	Y
23	11
33	23  <---
65	45
78	43
34	78
33	24  <---
44	45
54	87
23	34
33	22  <---
58	54
34	95
33	23  <---
45	84
	
	
Output	
X	Y
33	22
33	23
33	24


example duplicate in Y

Input	
Y	X
11	23
32	45  <---
45	65
43	78
78	34
78	45  <---
45	44
87	54
34	23
76	45  <---
54	58
95	34
77	45  <---
84	45
	
	
Output	
X	Y
76	45
77	45
78	45

OK, how about this:

from collections import defaultdict
import pandas as pd

df = pd.DataFrame(
    [
        (11, 23),
        (32, 45),
        (45, 65),
        (43, 78),
        (78, 34),
        (78, 45),
        (45, 44),
        (87, 54),
        (34, 23),
        (76, 45),
        (54, 58),
        (95, 34),
        (77, 45),
        (84, 45),
    ], columns=('Y', 'X')
)

entries = defaultdict(list)

for x, y in zip(df.loc[ : , 'X'], df.loc[ : , 'Y']):
    entries[x].append(y)

print('X\tY')

for x, y_list in entries.items():
    y_list.sort()

    for start in range(len(y_list) - 2):
        if y_list[start] == y_list[start + 1] - 1 == y_list[start + 2] - 2:
            for y in y_list[start : start + 3]:
                print(f'{x}\t{y}')
1 Like

No worries.

I’m still not too sure about what you want out of the second example, but maybe the solution is already there in the post from @MRAB.

In any case, this is a modified version of the script from my last post:

data = {
    'x': [23, 33, 65, 78, 34, 33, 44, 54, 23, 33, 58, 34, 33, 45],
    'y': [11, 23, 45, 43, 78, 24, 45, 87, 34, 22, 54, 95, 23, 84]
}

duplicates = {}

for index, x in enumerate(data['x']):
    for dup_index, dup_x in enumerate(data['x']):
        if index == dup_index:
            pass
        else:
            if dup_x == x:
                if x not in duplicates:
                    duplicates[x] = [data['y'][index]]
                else:
                    duplicates[x].append(data['y'][index])

discard = []

for key in duplicates:
    if len(duplicates[key]) < 4:
        discard.append(key)
for key in discard:
    duplicates.pop(key)

for x in duplicates:
    y_values = duplicates[x]
    y_values = list(set(y_values))
    y_values.sort()
    duplicates[x] = y_values

print("x\ty")
for x in duplicates:
    for y in duplicates[x]:
        print(f"{x}\t{y}")

[output]

x	y
33	22
33	23
33	24
1 Like

Thanks Matt…it works…by the way, how about if there’s no detection.
The results like below:

X     Y

How i can make a code to detect if X and Y is empty…I tried len and info and others method but got error. Thanks a lot to you.

For first part using pandas tools:

  • get unique values in X
  • get subframes for every unique value
  • check whether there is at least three rows
import pandas as pd

data = {
    'X': [58, 68, 38, 72, 47, 19, 29, 74, 75, 18, 29, 30, 29, 30, 29, 50, 73],
    'Y': [27, 28, 42, 49, 50, 54, 62, 65, 65, 70, 82, 82, 83, 83, 84, 94, 95]
}

df = pd.DataFrame(data)
uniq = pd.unique(df['X'])

for value in uniq:
     mask = df['X'].values == value   # or simply df['Y'] == value
     subframe = df[mask]
     if 3 <= subframe.index.size:
         print(subframe)

# will output
    X   Y
6   29  62
10  29  82
12  29  83
14  29  84

For me there is some ambiguity how should Y column be checked. Should entire column be increasing? If so pd.Series.is_monotonic_increasing could be used. If not, what are the exact requirements/conditions for increase?

1 Like

Inside of printing the results in the loop, just store them in a list and then print them out afterwards.