Find the intersection of one list and one dataframe

df = pd.DataFrame({'a': [7,8,9], 'b':[4,5,6]})
y = [11,1,6]

I want to find the intersection of df[‘b’] and y, but this

[x for x in y if x in df['b']]

gives me

[1]

but I expect

[6]

In order to get [6], I need

[x for x in y if x in list(df['b'])]

Why doesn’t my first way give me a [6]?

Inspecting in the REPL gives us a hint of what’s happening:

>>> df = pd.DataFrame({'a': [7,8,9], 'b':[4,5,6]})
>>> df['b']
0    4
1    5
2    6
Name: b, dtype: int64
>>> 0 in df['b']
True
>>> 1 in df['b']
True
>>> 2 in df['b']
True
>>> 3 in df['b']
False
>>> 4 in df['b']
False
>>> 5 in df['b']
False

The condition x in df['b'] tests whether x is one of the elements of the index of df['b'] (which you can access directly via df['b'].index). When you write list(df['b']), you are transforming the pandas series df['b'] into a Python list, which has no concept of “index”, so x in some_list really tests whether x is one of the elements of some_list.

In order to get intersections in a straightforward way, you can use sets:

>>> set(df['b']).intersection([11, 1, 6])
{6}
1 Like

Hey John @fonini has put it well clear…its all about data types. elements in the df are being stored as pandas series objects while a list a can have diferrent data types which I was playing about and found comparing a "6" to an int 6 with the list method will return true
The set method con limits to unique intersection.
nywy also found numpy already has a method for this which would be nicer for large dataset

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [7, 8, 9, 8], 'b': [4, 5, 6, 9]})
y = [11, 9, '6', 6]

# Convert the 'b' column to a numpy array
b_values = df['b'].values

intersection = np.intersect1d(b_values, y)   # exolicit with the data types

print(intersection)

print([x for x in y if x in df['b'].to_list()])  # if you use this method it will compare string and int

print(set(df['b']).intersection([11, 1, 6]))