Find list items, that contain items from other list in python

Ivan3 · May 26, 2024, 8:32pm

Hi team,

I’am trying to write a query that can return me the list items if they contain words from other list.

python code
…
message = [‘/cxpslogs/powerBI/pruebasTransaction/testivan/message.log.2024-05-10_1742.2024-05-10_1744.cexpswap4c’,‘/cxpslogs/powerBI/pruebasTransaction/testivan/message.log.2024-05-10_1742.2024-05-10_1744.cexpswap4p’]
transaction = [‘/cxpslogs/powerBI/pruebasTransaction/testivan/message.log.2024-05-10_1742.2024-05-10_1744.cexpswap4c’,‘transaction.log.2024-05-10_1744.cexpswap4p’,‘transaction.log.2024-05-10_.cexpswap4p’]

output=
for data in message:
for word in data.split(‘.’):
if word in transaction:
output.append(data)
print(output)
…
In this case i need to obtain like output

transaction.log.2024-05-10_1744.cexpswap4p because it is the only match

But the script doesn’t show me that I need.

I need your help please.

Regards

…

cameron · May 27, 2024, 2:47am

What you need to do is to extract the words of interest from the other
list and store them so that they can be easily consulted. Just looking
at your example data, I would:

use a regexp (see the re module) to extract the important bit from each string
store the words in a set, which has O(1) membership lookup
consult the set when scanning the message list, instead of transaction

dg-pb · May 27, 2024, 10:44am

t_words = set.union(*[t.split('.') for t in transaction])
output = set()
for data in message:
    words = set(data.split('.'))    
    if words & t_words:
        output.add(data)
        
print(output)

Ivan3 · May 27, 2024, 12:22pm

Thanks grigonis

I get this output, running the query
…

TypeError Traceback (most recent call last)
in
1 message = [‘/cxpslogs/powerBI/pruebasTransaction/testivan/message.log.2024-05-10_1742.2024-05-10_1744.cexpswap4c’,‘/cxpslogs/powerBI/pruebasTransaction/testivan/message.log.2024-05-10_1742.2024-05-10_1744.cexpswap4p’]
2 transaction = [‘/cxpslogs/powerBI/pruebasTransaction/testivan/message.log.2024-05-10_1742.2024-05-10_1744.cexpswap4c’,‘transaction.log.2024-05-10_1744.cexpswap4p’,‘transaction.log.2024-05-10_.cexpswap4p’]
----> 3 t_words = set.union(*[t.split(‘.’) for t in transaction])
4 output = set()
5 for data in message:

TypeError: descriptor ‘union’ for ‘set’ objects doesn’t apply to a ‘list’ object
…

Thanks you

Ivan3 · May 27, 2024, 12:24pm

Thanks Cameron.

Set function is faster to lookup the data?

What do you mean, when you talk about “which has O(1) membership lookup”

Regards and thanks

dg-pb · May 27, 2024, 12:30pm

Just wrap t.split('.') in set(...)

Ivan3 · May 27, 2024, 12:59pm

What would it be like?

…
t_words = (set.union(*[t.split(‘.’) for t in transaction]))
…

dg-pb · May 27, 2024, 1:00pm

t_words = (set.union(*[set(t.split(‘.’)) for t in transaction]))

onePythonUser · May 27, 2024, 1:09pm

I see you’re trying but perhaps you’re pressing the wrong key.

Ivan3 · May 27, 2024, 1:26pm

Thanks grigonis

I need to extract from the list a substring marked in black
E.g

message = [‘/cxpslogs/powerBI/pruebasTransaction/testivan/message.log.2024-05-10_1742.2024-05-10_1744.cexpswap4c’,‘/cxpslogs/powerBI/pruebasTransaction/testivan/message.log.2024-05-10_1742.2024-05-10_1744.cexpswap4p’]
transaction = [‘/cxpslogs/powerBI/pruebasTransaction/testivan/message.log.2024-05-10_1742.2024-05-10_1744.cexpswap4c’,‘transaction.log.2024-05-10_1744.cexpswap4p’,‘transaction.log.2024-05-10.cexpswap4p’]

What would it be like?

Thanks for your help!

dg-pb · May 27, 2024, 1:49pm

message = ['/cxpslogs/powerBI/pruebasTransaction/testivan/message.log.2024-05-10_1742.2024-05-10_1744.cexpswap4c','/cxpslogs/powerBI/pruebasTransaction/testivan/message.log.2024-05-10_1742.2024-05-10_1744.cexpswap4p']
transaction = ['/cxpslogs/powerBI/pruebasTransaction/testivan/message.log.2024-05-10_1742.2024-05-10_1744.cexpswap4c','transaction.log.2024-05-10_1744.cexpswap4p','transaction.log.2024-05-10_.cexpswap4p']

def extract(s):
    s1, s2 = s.split('.')[-2:]
    return (s1.split('_')[-1], s2)

words = {extract(data) for data in message}
output = [t for t in transaction if extract(t) in words]
print(output)

Aparently 1744.cexpswap4c also matches.

Please learn formatting your code nicely so that it is convenient to work with for those who support you:
a) Make sure quotes are correct. They sometimes get pasted incorrectly and code does not work when coppied.
b) Put your code in a nice box. @onePythonUser has kindly given you a guide how to do it.

Ivan3 · May 27, 2024, 3:21pm

Thanks Grigonis!

I only need the transaction log output from the transaction list matched with message list. In this output from the example, the output show us the below:
…
[‘/cxpslogs/powerBI/pruebasTransaction/testivan/message.log.2024-05-10_1742.2024-05-10_1744.cexpswap4c’, ‘transaction.log.2024-05-10_1744.cexpswap4p’]
…

dg-pb · May 27, 2024, 3:35pm

Then you will need to figure out how to define your problem better. Matching that part of the string results in 2 matches.

Ivan3 · May 27, 2024, 5:50pm

for example, I have two lists:

A=[message.2024-12-05_12_744.cexpswap4p**,message.2024-12-05_12_734.cexpsdap4c ]
B=[transaction.2024-12-05_12_744.cexpswap4p,transaction.2024-12-05_12_733 ]

In this case:

The output about B list must be:

transaction.2024-12-05_12_744.cexpswap4p

I need to show the output matched about B against A

Regards

cameron · May 27, 2024, 10:55pm

Then you need to dissect the strings to get the pieces you want. If the
examples above are representative you should be able to split the
strings on the dot character. If you only care about the final part (eg
cexpswap4p then:

 parts = item.split('.')
 transaction_id = parts[-1]

would break the string on the dots, and put the last part into the
variable transaction_id. If that’s not the correct section of the
string, please explain precisely.

That would get you the key values from each string which you need to
compare.

So you want to get the keys of interest from A I assume. And then
iterate over B, collection the items which match the keys from A?

I would write a small function to extract the key from one of y0our
strings, eg:

 def transaction_key(item):
     parts = item.split('.')
     transaction_id = parts[-1]
     return transaction_id

Adjust to suit whatever the correct way is to extract the key.

Then you could collect all the keys from A pretty easily:

 Akeys = set(
     transaction_key(item) for item in A
 )

This does 2 things:

produces keys from items by computing transaction_key(item) for each item in A
collects all those keys in a set, which is a data structure for holding unique values

The you just want to iterate over B, comparing its keys against your
collection Akeys. For example, code like this:

 for item in B:
     Bkey = transaction_key(item)
     if Bkey in Akeys:
         print("found", item)

Now, that just prints these out as matched. I’ll let you try to write
code to collect the matching items into a list.

cameron · May 27, 2024, 11:20pm

When you need to find out if something is in some collection of items
such as a list, how slow (expensive) that it depends on the nature of
the collection.

If you have a list which was in no particular order you would need to
examine each item in the list to see if it matched. So the expense, on
average, would be proportional to the length of the list. We call this
cost O(N) where N represents the length of the list. Big-O notation
means the order (scale) of the expense as a function of some measure of
the problem - here the measure is the length of the list.

See: Big O notation - Wikipedia for detail and further
references.

If the list was known to be in order, for example ascending order, you
have a couple of choices to take advantage of this:

search until you find a match or the items exceed your target value -
at that point you know there’'s no point in searhcing further; that is
also O(N), but better - about o(N/2) on average
do a bisect search

A bisect search starts in the middle of the list. If your item is less
than that entry, pick a point half way between the start and that entry
and look again. If it’s more, pick a point half way between that entry
and the end and look again. In this way we start with two bounds: the
start and end of the list. Pick an item half way between and compare.
Then choose the left or right side as the new boundary and repeat.

In this way the boundaries get closer together until you find a match or
the boundaries come together (no match). Because each iteration halves
the size of the range, you need about O(log2(N)) iterations. This is
much faster than the linear search. For larger lists. For a tiny
list like your 3-element examples the mucking about exceeds the cost of
just scanning the list. In the real world the bisect approach is a win
for this.

A set is better still. It uses a data structure internally called a hash
table: Hash table - Wikipedia

A hash table is essentially a list, each of whose entries is a small
list of items. You decide where an item goes in the list with a hash
function whose purpose is to (on average) distribute the items evenly
across the list, and the hash function always produces the same value
for a given item. You choose the size of the list based on the number of
values you’re storing in it in. The idea is that on average each entry
in the list has very few items stored it in, often 0 or 1.

Then, when you go to see if some test item is in the table, you compute
the test item’s hash value, and use that to pick which entry would
hold this item if it were in the list. Then you only need to see if the
item occurs in the very short sublist stored in the entry, and can
ignore all the other entries completely.

By sizing the table to have very few items in any entry, this makes the
cost of looking things up the cost of searching a 0 or 1 element list,
which is basicly a constant small time. We call this O(1) in big-O
notation.

This is amazingly effective. Of course it isn’t free: you need to do
work to create and populate the list using the hash function etc. But
once made, lookups are very fast.

A Python set uses a hash table to index its element, and thus has
O(1) lookup time (the cost of a single item in Akeys test).

A Python dict which maps keys to values also uses a hash table for
its lookups.

When you make a set, eg:

 s = set( (1, 2, 3, 4, 2, 6, 7) )

Python does all this work for you, and then you can test:

 item in s

very cheaply.

Find list items, that contain items from other list in python

I get this output, running the query …

I get this output, running the query
…