Merging a sublist if partial match exists

cheesebird · June 19, 2022, 9:22am

I’m not sure if this is even possible but I want to merge a sublist if there is a partial and then remove the sublists from the main list.

s = [['X1: Cheese' ,' 99','98','97'], ['77','78','76'],['X1: Cheese','99','98','97','10','11'], ['X1: Cheese ','99','98','97','22','21'],['X1: NoCheese', '99','98','97']]

Output…

[['X1: Cheese' ,' 99','98','97','10','11','22','21'], ['77','78','76'], ,['X1: NoCheese', '99','98','97']]

So I’d be matching the ‘X1: Cheese’ and merging the values that don’t exist in the first sublist found.

The values here are all variables so can’t be searched simply by the word ‘Cheese’, they would have to be matched dynamically.

Not even sure how to approach this

cheesebird · June 19, 2022, 11:17am

I played around with this…

from itertools import groupby
import itertools
sz = [['X1: Cheese' ,' 99','98','97'], ['77','78','76'],['X1: Cheese','99','98','97','10','11'], ['X1: Cheese','99','98','97','22','21'],['X1: NoCheese', '99','98','97']]
tup = list(map(tuple,sz))
fixed_list = [[idx]+list(itertools.chain.from_iterable(list(filter(None, i[1:])) for i in e)) for idx, e in groupby(tup, lambda x: x[0])]
print(fixed_list)

Output

[['X1: Cheese', ' 99', '98', '97'], ['77', '78', '76'], ['X1: Cheese', '99', '98', '97', '10', '11', '99', '98', '97', '22', '21'], ['X1: NoCheese', '99', '98', '97']]

But i would expect the output to be…

[['X1: Cheese', '99', '98', '97', '10', '11', '99', '98', '97', '22', '21'], ['77', '78', '76'], ['X1: NoCheese', '99', '98', '97']]

i.e this group also to be removed ?
['X1: Cheese', ' 99', '98', '97']

Not sure what I’m doing (wrong)

rob42 · June 19, 2022, 11:25am

Is there a use case for this, or is it simply an academic exercise?

cheesebird · June 19, 2022, 11:54am

Yes eventually will be a use case for the Astronomy club

rob42 · June 19, 2022, 12:55pm

Could you provide an example of your use case?

I’m not saying that I’ll be able to help (maybe yes, maybe no) but I’ve trouble with abstract concepts and your code thus far seems (to my mind, least ways) to fall into the abstract {others here may disagree with this and are clearly free to do so}.

mlgtechuser · June 19, 2022, 1:07pm

Ross, I’m not following what you’re after, so I reflowed the s = [] list to be able to see it all at once.
It looks like you need to consolidate the 'X1:' and the numbers and filter duplicates. Is that right?

#INPUT
s = [['X1: Cheese' ,' 99','98','97'], ['77','78','76']
    ,['X1: Cheese','99','98','97','10','11']
    ,['X1: Cheese ','99','98','97','22','21']
    ,['X1: NoCheese', '99','98','97']]
#OUTPUT
[['X1: Cheese' ,' 99','98','97','10','11','22','21']
,['77','78','76'],['X1: NoCheese', '99','98','97']]

We need to know some hard rules (this is the “system specification” that we discussed at length in the original puzzle you brought us). For example, can the variables be resolved positionally? That is, will the first list member always be `‘X1:<something>’ or a number?

If you present “Here’s my input” and “here’s the output I need”…“except that my input changes in some unspecified ways”, then that’s an unsolvable situation.

cheesebird · June 19, 2022, 1:51pm

Ok understood.

I’m trying to search by the first element of a sublist s[0], so if there multiple occurrence of ‘X1: ****’ which match. These will always be at s[0] first element of the sublist.
The example sublist …
``Preformatted text['77', '78', '76']
Was only put in the list to show it was not considered.
So for your nicely reformatted example…

#INPUT
s = [['X1: Cheese' ,' 99','98','97'], ['77','78','76']
    ,['X1: Cheese','99','98','97','10','11']
    ,['X1: Cheese ','99','98','97','22','21']
    ,['X1: NoCheese', '99','98','97']]
#OUTPUT
[['X1: Cheese' ,' 99','98','97','10','11','22','21']
,['77','78','76'],['X1: NoCheese', '99','98','97']]

We see that there are 3 matching first elements ‘X1: Cheese’
From here I would like to merge these matched subsists into 1 sublist.
So it would look like this…

[['X1: Cheese' ,' 99','98','97','10','11','22','21']

So it contains all the elements from the matched sublist including the ones that were already present.
The other sublists , that weren’t matched by the first element remain intact.

Edit…
If I sort the list…
sz.sort()
I get the correct output…
.

[['77', '78', '76'],
 ['X1: Cheese', ' 99', '98', '97', '99', '98', '97', '10', '11', '99', '98', '97', '22', '21'],
 ['X1: NoCheese', '99', '98', '97']]

So why doesn’t it work in an unsorted list?

vbrozik · June 19, 2022, 2:43pm

It is simply how itertools.groupby() works:

… It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). …

cheesebird · June 19, 2022, 2:59pm

I see. Are there any alternative methods that will allow the list to remain sorted and merge the required sublists?

vbrozik · June 19, 2022, 3:33pm

To be precise: I guess you mean the list to keep its original sort order.

I think the easiest method would be to:

Add the original index to the items (eg. as a last member in the list or encapsulate the items to tuples (item, index)).
Perform the grouping as you do - of course modifying the the algorithm to take into account that you added the index.
Resort the results according to the original index.
Remove the original index.

cheesebird · June 19, 2022, 3:40pm

Correct

cheesebird · June 19, 2022, 3:41pm

Not particularly easy for me

Do you have some code examples how this could be achieved?

vbrozik · June 19, 2022, 3:57pm

Just the first step:

s = [['X1: Cheese' ,' 99','98','97'], ['77','78','76'],['X1: Cheese','99','98','97','10','11'], ['X1: Cheese ','99','98','97','22','21'],['X1: NoCheese', '99','98','97']]
list(enumerate(s))

[(0, ['X1: Cheese', ' 99', '98', '97']),
 (1, ['77', '78', '76']),
 (2, ['X1: Cheese', '99', '98', '97', '10', '11']),
 (3, ['X1: Cheese ', '99', '98', '97', '22', '21']),
 (4, ['X1: NoCheese', '99', '98', '97'])]

The rest is upon you or someone else.

cheesebird · June 19, 2022, 5:34pm

Thanks.

Could anyone point out where I’m going wrong?

from itertools import groupby
import itertools
sz = [['X1: Cheese' ,' 99','98','97'], ['77','78','76'],['X1: Cheese','99','98','97','10','11'], ['X1: Cheese','99','98','97','22','21'],['X1: NoCheese', '99','98','97']]
sz = [enum_item for enum_item in enumerate(sz)]
sz = sorted(sz, key=lambda x:x[1])
print(sz)
tup = list(map(tuple,sz))
fixed_list = [[idx]+list(itertools.chain.from_iterable(list(filter(None, i[1:])) for i in e)) for idx, e in groupby(tup, lambda x: x[0])]
print(fixed_list)

This outputs erroneously…

[[1, ['77', '78', '76']], [0, ['X1: Cheese', ' 99', '98', '97']], [2, ['X1: Cheese', '99', '98', '97', '10', '11']], [3, ['X1: Cheese', '99', '98', '97', '22', '21']], [4, ['X1: NoCheese', '99', '98', '97']]]

vbrozik · June 19, 2022, 8:24pm

First mistake was that you grouped by x[0] instead of x[1][0] - forgetting that the structure has changed. After going to this point it was easier to start writing a new code than analyzing yours

Input data:

sz = [
    ['X1: Cheese' ,' 99','98','97'],
    ['77','78','76'],
    ['X1: Cheese','99','98','97','10','11'],
    ['X1: Cheese','99','98','97','22','21'],
    ['X1: NoCheese', '99','98','97']]

The code:

import itertools

def sz_grouping_key(item):
    return item[1][0]

sz_group_sorted = sorted(enumerate(sz), key=sz_grouping_key)   # step 1

sz_grouped = itertools.groupby(sz_group_sorted, sz_grouping_key)

sz_unsorted_result = []
for first_item, group_list in sz_grouped:
    original_indexes = []
    inner_items = set()   # items should not repeat -> use set, note: ordering lost!
    for original_index, group_item in group_list:
        original_indexes.append(original_index)
        inner_items.update(group_item[1:])  # we have first_item, skip it
    sz_unsorted_result.append(
            (min(original_indexes), [first_item] + list(inner_items)))

sz_processed = [item[1] for item in sorted(sz_unsorted_result)]  # steps 3 and 4

Output:

[['X1: Cheese', '99', '11', '10', '97', '22', '98', ' 99', '21'],
 ['77', '78', '76'],
 ['X1: NoCheese', '99', '97', '98']]

Notes:

'77' is interpreted as a normal grouping key, if you do not want this behaviour, either insert some special value at the beginning of such lists of change the grouping function sz_grouping_key()
Collecting items to inner_items = set() does not keep their order. If this is undesirable, change the way of collecting the items. The first_item is added separately.
Do not write long expressions. Split the code to smaller steps and use descriptive variable names. Otherwise no-one will understand the code.
Code split to smaller parts is also much easier to troubleshoot. You can analyze the code by inserting diagnostic print() commands.
Earlier I wrote convoluted [enum_item for enum_item in enumerate(s)] instead of simple list(enumerate(s)) - going to fix this.

cheesebird · June 20, 2022, 6:05am

Agreed thanks for the tips and the code works perfectly. Many thanks

mlgtechuser · June 23, 2022, 3:03pm

Here’s an old-school approach with looping and parsing. It initializes a list and builds a list as an XY table. The steps are:

Looping through each element in the ‘s’ data list to find unique (or repeated) sublists.
Assigning the sublist to a column.
- If new item: New column
- If repeated item containing ‘:’: add to column assigned to that item.
Parse the columns one at a time to pick up unique sublist items.
- append() the unique items to a mergedColumn list.
- append() the mergedColumn list to a mergedTable list as the final output.

It was very good at finding the random space characters sprinkled through the sample s data.

I recommend tucking the two sections away as findUniques() and combineUniques() functions to keep the main code clean.

I’m sure I worked on this way longer than Vàclav, but it just crunches the data and follows the specs of our resident astronomer without any quirks except (possibly) the dependence on ‘:’ in the data. This is reportedly a reliable search argument, though.

OUTPUT:

['X1: Cheese', ' 99', '98', '97', '99', '10', '11', '22', '21']
['77', '78', '76']
['X1: NoCheese', '99', '98', '97']

s = [['X1: Cheese' ,' 99','98','97']
    ,['77','78','76']
    ,['X1: Cheese','99','98','97','10','11']
    ,['X1: Cheese','99','98','97','22','21']
    ,['X1: NoCheese', '99','98','97']]

table = []
for item in s:
    if len(table) == 0:                                     #seed the table or append(blank_row) will fail
        table.append([item])
        continue
    for colNum in range(len(table[0])): table[-1].append([])   #add a column to table
    table.append([])                                        #start a blank row
    if item[0].find(":")+1:                                 #first element of 'item' contains ':'
        if item[0] in [element
            for row in range(len(table)-2)
                for col in range(len(table[0]))
                    for element in table[row][col]]:                #repeated ':' item
            for row in table:
                if item in table[-1]: break                             #'item' added to last row; break out of loop
                for colNum in range(len(row)):
                    if row[colNum][0] == item[0]:
                        table[len(table)-1][colNum] = item              #assign to a column in last row
                        break
                    else: print("Didn't find a column")
        else:                                               #new ':' item
            for row in range(len(table)): table[row].append([]) #add a new column
            table[len(table)-1][len(table[0])-1] = item                 #assign to bottom right corner
    else:                                                   #non ':' item
        for row in range(len(table)): table[row].append([]) #add a new column
        table[len(table)-1][len(table[0])-1] = item                     #assign to bottom right corner
for row in table:
    print(row)
print()

mergedRow = []
mergedTable = []
for colNum,_ in enumerate(table[0]):     #index through columns in table
    for row in table:
        for item in row[colNum]:
            if item not in mergedRow:
                mergedRow.append(item)
    mergedTable.append(dcopy(mergedRow))
    mergedRow.clear()
for row in mergedTable:
    print(row)

cheesebird · June 24, 2022, 6:10am

@mlgtechuser

Nice code man, works really well.