Merging files by data match

Feels · October 25, 2023, 9:41pm

I have two files with following content:

a.txt

path: /some/location/1log.txt
item1

path: /some/location/2log.txt
item2

b.txt

item1
value $5

item2
value $10

and i want to merge them into a single file like this:

c.txt

path: /some/location/1log.txt
item1
value $5

path: /some/location/2log.txt
item2
value $10

d_n · October 25, 2023, 9:58pm

Seems straightforward. What code have you written so-far?

Some significant information which appears to be missing:

what if there is no match for an a-file entry in file-b?
what if there no entry in file-a but there is in file-b?
are the files guaranteed to be in the same sequence by item-number?
how large are the files, ie could (at least one) be kept in-memory?

MRAB · October 25, 2023, 10:09pm

I’d parse one of the files, making a dict where the key is the item and the value is the info, and then parse the other file, looking up the additional info in the dict using the item.

Feels · October 26, 2023, 11:22am

Will answer your questions to clarify task.

if no match for an a-file in file-b, then skip it. So print to c only matched entries.
if no entry in file-a but there is in file-b then also skip them.
item numbers may not be in sequence. Also some other info may by prepended.
E.g. lines in b may have such entries:

description
item4
value $6
some_other_info

description
item1
value $5
yet_another_info

description
item2
value $8
more_another_info

in my real task these files could have size up-to 100mb. It’s quite fine to keep full content in memory.

Then desired output should be:

path: /some/location/1log.txt
description
item1
value $5
yet_another_info

path: /some/location/2log.txt
description
item2
value $8
more_another_info

So far tried such approach, which is not giving desired output.

# Dictionary to store the merged content
merged_content = {}

# Reading a.txt
with open("a.txt", "r") as file_a:
    lines_a = file_a.readlines()

# Reading b.txt
with open("b.txt", "r") as file_b:
    lines_b = file_b.readlines()

# Variables to keep track of the current path
current_path = None

# Iterate through the lines of both files
for line in lines_a + lines_b:
    line = line.strip()

    if line.startswith("path:"):
        current_path = line
        merged_content[current_path] = []
    elif current_path is not None:
        merged_content[current_path].append(line)

# Write the merged content to c.txt
with open("c.txt", "w") as file_c:
    for path, items in merged_content.items():
        file_c.write(path + "\n")
        for item in items:
            file_c.write(item + "\n")

MRAB · October 26, 2023, 5:42pm

Your dict merged_content tries to link the entries of the 2 files by the path: parts, but only a.txt has those.

As I suggested previously, you should be linking them by the item part.

d_n · October 26, 2023, 8:01pm

Completely agree with @MRAB’s responses.

Firstly ask: what is the “key” which links the entries in the two files? (not path!)

Secondly, (ref earlier Qu about file-size) read the smaller file and (start to) populate the merged_content dictionary, line-by-line. Given that the data appears to have a key:value format, remember that it is possible to “nest” dictionaries. Thus, each line/row/record in merged_content could consist of a dict, eg

{ item1: { path: "etc/1log.text", desc="description", value=5.00, more... },
  item2: { ... },
  etc,
}

Yes, in some cases there may not be a label, eg the description, but forming such a structure is likely to help in designing this collection-phase and whatever comes next!

Thirdly, having built a basic data-structure, read through the longer file. Follow the rules outlined above and ignore the record if the input doesn’t match. If it does match, add the data to the appropriate entry (nested-dict) in merged_content.

Lastly, create file c.txt by iterating through merged_content - but check that elements from both file-a and file-b appear, eg (apparently defined as: the inner-dict has keys for both path and value) and discard/ignore irrelevant entries.

By splitting-up the task into functional-elements (yes, they could be coded as individual functions), each step can be tested in-isolation. Thus, it becomes obvious where any fault lies. (not that we make mistakes!)

Once the system is working (correctly), yes, it will become apparent that some steps could be combined. However, KISS-principle applies, or in IT-philosophy “make it work, before you make it better” and “premature optimisation is the root of all evil”…

NB if the files were both sorted by key (itemNR) then the job could be done in a single “pass”. That was the way we performed a lot of data-processing back in the ?good, old, mainframe days - hence that question. So, when you’ve finished tinkering with the code; consider the question of data-formatting (if the data were otherwise organised, ie “designed” would it make the coding easier?): might it be quicker to sort both files first, and then merge them?

Thus, some up-front thinking and design may save coding-time/complexity! OTOH once the coding-job has been done…

Feels · October 27, 2023, 5:04pm

Alright, after digging in real task i found that nested dictionaries won’t work in my case, because data blocks in b.txt may contain deferent names and lines numbers, the only uniq searched itemNR remains the same. That’s why i needed to split data in blocks having boundaries as empty lines. The GNU awk has a simple command to find matched pattern and print everything until new lines in both directions.

awk 'BEGIN{RS=""} /searched_string/' file

and in python it will be even simpler like re.split(r’\n\n’, content)
And working code, giving desired output

import re

# Function to search for a string in a record
def search_string_in_record(record, search_string, path):
    if re.search(search_string, record):
        print(f'path: {path}')
        print(f'Search string "{search_string}" found in the record:')
        print(record)
        print("\n")

# Read search strings and paths from 'a.txt'
search_strings = []
paths = []
with open('a.txt', 'r') as search_file:
    current_path = None
    for line in search_file:
        line = line.strip()
        if not line:
            continue # Skip blank lines
            #current_path = None
        if line.startswith("path: "):
            current_path = line.replace("path: ", "")
        elif current_path:
            paths.append(current_path)
            search_strings.append(line)

# File to search within
target_file = 'b.txt'

# Search for each search string in the target file's records
with open(target_file, 'r') as file:
    content = file.read()
    records = re.split(r'\n\n', content)

for path, search_string in zip(paths, search_strings):
    for record in records:
        search_string_in_record(record, search_string, path)

%python3 tmp.py

path: /some/location/1log.txt
Search string "item1" found in the record:
description
item1
value $5
yet_another_info


path: /some/location/2log.txt
Search string "item2" found in the record:
description
item2
value $8
more_another_info

However i found one glitch. If a.txt contains multiple items after the “path:” section, script will print same “path:” again for each item in corresponding “path:”

Example:

a.txt
path: /some/location/1log.txt
item1
item4

path: /some/location/2log.txt
item2

b.txt
description
item4
value $6
some_other_info

description
item1
value $5
yet_another_info

description
item2
value $8
more_another_info

description
item8
value $59
info_another_info

Script has duplicated “path: /some/location/1log.txt”

path: /some/location/1log.txt
Search string "item1" found in the record:
description
item1
value $5
yet_another_info


path: /some/location/1log.txt
Search string "item4" found in the record:
description
item4
value $6
some_other_info


path: /some/location/2log.txt
Search string "item2" found in the record:
description
item2
value $8
more_another_info

So yeah @d_n i made it just works, now need bit improvement)

MRAB · October 27, 2023, 5:29pm

Here’s my own take on it:

paths = {}

with open('a.txt') as in_file:
    for line in in_file:
        if line.startswith('path'):
            current_path = line
        elif line.startswith('item'):
            paths[line] = current_path

with open('b.txt') as in_file:
    with open('c.txt', 'w') as out_file:
        for line in in_file:
            out_file.write(line)

            if line.startswith('item') and line in paths:
                out_file.write(paths[line])

d_n · October 27, 2023, 8:29pm

It works to your satisfaction. That’s good!

May I refer you to the last section of the previous response which discussed the need to (completely) analyse the data, ie understand the exact problem(s) to be solved! Not only did you need to add ‘rules’ AFTER (some) code had been written, but it relegated some of the answers-provided into irrelevance. Is this respectful use of your time? …our time? …clock time?

Speaking of specification-changes, is the output-file no-longer required?

Assuming this is not a one-off data-conversion exercise, such may also suggest that the processing which produces these files could also do with a bit of data-analysis/application of sanity…

If the data-file(s) are considered to provide data in ‘blocks’, where each block may consist of a number of lines (per the “glitch”). A common ComSc approach is State Transition. In this case, once a path is found, every line must be copied and kept ‘with’ that path - until the next blank line is input. This deals with he problem of an unpredictable number of lines per ‘block’. Might this also inform/take care of the duplication-issue? Regardless, a useful tool for one’s tool-box!

Also, note that unlike earlier processing, with file-B there is no stripping of white space. Could/should this concern /treatment be applied to both input-files, ie a multi-use function?

Is search_string_in_record descriptive of the tasks being carried-out? My bias is to be wary of RegEx (YMMV!). The RegEx only seeks to advise if the record contains the search-string. Why not simplify to in?