Handling elements in a file

PJh · January 26, 2023, 6:50pm

My data is a text file, as shown below:

a:b,0.3333333333333333
b:a,0.3333333333333333
a:b,0.5
b:a,0.5
a:b,0.3333333333333333
a:b,0.3333333333333333
b:c,0.1111111111111111
b:c,0.2
b:c,0.14285714285714285
b:c,0.125
b:c,0.14285714285714285
b:c,0.125
b:c,0.25
b:c,0.16666666666666666
b:c,0.14285714285714285
b:c,0.16666666666666666
c:b,0.1111111111111111
c:b,0.1111111111111111
c:b,0.1111111111111111
c:b,0.1111111111111111
c:b,0.1111111111111111
b:d,0.14285714285714285
b:d,0.2
b:d,0.16666666666666666
b:d,0.25
d:b,0.14285714285714285
d:b,0.14285714285714285
b:e,0.2
b:e,0.25
e:b,0.2
a:b,0.3333333333333333
a:b,0.5
b:a,0.3333333333333333
a:b,0.3333333333333333
a:b,0.5
b:a,0.3333333333333333
a:c,0.1
a:c,0.16666666666666666
a:c,0.1111111111111111
a:c,0.1
a:c,0.1
a:c,0.14285714285714285
a:c,0.16666666666666666
a:c,0.2
a:c,0.16666666666666666
a:c,0.25
a:d,0.1
a:d,0.125
a:d,0.25
a:d,0.5
a:e,0.16666666666666666
a:e,0.16666666666666666
c:b,0.1111111111111111
c:b,0.125
b:c,0.1111111111111111
c:b,0.2
c:b,0.25
b:c,0.1111111111111111
c:b,0.14285714285714285
c:b,0.16666666666666666
b:c,0.1111111111111111
c:b,0.125
c:b,0.14285714285714285
b:c,0.1111111111111111
c:b,0.14285714285714285
c:b,0.16666666666666666
b:c,0.1111111111111111
c:a,0.1
c:a,0.14285714285714285
c:a,0.16666666666666666
c:a,0.16666666666666666
c:a,0.1111111111111111
c:a,0.2
c:a,0.1
c:a,0.16666666666666666
c:a,0.1
a:c,0.25
c:d,0.14285714285714285
c:d,0.14285714285714285
c:d,0.16666666666666666
c:d,0.25
c:d,0.2
c:d,0.2
c:d,0.16666666666666666
c:d,0.16666666666666666
d:c,0.14285714285714285
c:d,0.25
c:e,0.16666666666666666
c:e,0.5
c:e,0.2
c:e,0.16666666666666666
c:e,0.16666666666666666
d:b,0.14285714285714285
d:b,0.16666666666666666
b:d,0.14285714285714285
d:b,0.2
d:b,0.25
b:d,0.14285714285714285
d:a,0.1
d:a,0.25
d:a,0.125
a:d,0.5
d:c,0.14285714285714285
d:c,0.16666666666666666
d:c,0.2
d:c,0.16666666666666666
c:d,0.14285714285714285
d:c,0.14285714285714285
d:c,0.25
d:c,0.2
d:c,0.16666666666666666
c:d,0.25
d:e,0.16666666666666666
d:e,0.25
e:b,0.2
e:b,0.25
b:e,0.2
e:a,0.16666666666666666
e:a,0.16666666666666666
e:c,0.16666666666666666
e:c,0.5
e:c,0.2
e:c,0.16666666666666666
e:c,0.16666666666666666
e:d,0.16666666666666666
e:d,0.25

For every given pair a, and b only that line should be selected that has maximum value. The order of pair can be different as in example below for pairs a and b
[a:b,0.3333333333333333
b:a,0.3333333333333333
a:b,0.5
b:a,0.5
a:b,0.3333333333333333
a:b,0.3333333333333333], only one output should come a:b,0.5,

Also for following pairs a and e, when the variable have same value
[a:e,0.16666666666666666
a:e,0.16666666666666666
e:a,0.16666666666666666
e:a,0.16666666666666666], the output should be only 1 line a:e,0.16666666666666666. There should be no duplicate values, even if variables are in reverse order

rob42 · January 26, 2023, 7:15pm

In order that you may learn how to obtain any list item at any given index; this script is a demonstration:

the_list = ['a', 'b', 'c', 'd']

for index, item in enumerate(the_list):
    print(f"index: {index} | item: {item}")

output:
index: 0 | item: a
index: 1 | item: b
index: 2 | item: c
index: 3 | item: d

Now that you can see how a list object is indexed, why not have a go at your problem and post back if you have any questions.

PJh · January 26, 2023, 11:24pm

How can I create a logic that can keep only index:0 item when it encounters a,b#0, at index:0 and b,a#0 at index:1. Similarly, the program should select only one of the two items from b,c#0.1, c,b#0.1. Please note index positions are not fixed, and the items are an excrept of larger dataset of similar items.

rob42 · January 26, 2023, 11:45pm

I’m confused by the entire question: You say “To collect only one value from the list…”, but go on to say “The program should output a,b#0 b,d#0.2, b,c#0.1”. How is that one value?

What are these values?
What does ‘a,b#0’ mean?
Is the list object a collection of 5 string objects, as you have posted?
Are you trying to extract only the numbers of said strings?
What’s the origin of this ‘problem’?

PJh · January 26, 2023, 11:52pm

Yes, the list is object a collection of 5 string objects. I need to filter 1 out from a,b#0, b,a#0. The purpose is to if two variables, like a and b, in different order have similar value, here 0. Then only 1 of the two should be selected ( either a,b#0, or b,a#0) in the output list. Actually I have a text file containing such string objects in different lines like:
a,b#0
b,a#0
b,d#0.2
b,c#0.1
c,b#0.1
I need to create another text file containing only:
a,b#0
b,d#0.2
b,c#0.1

rob42 · January 27, 2023, 12:20am

So, you have a file, from which you are trying extract certain items, based on some criteria.

When you first posted, you said “list”. The word list means something very different to a Python coder, reading a post on a Python Forum, which is the reasoning behind my first post. It is clear now, that you mean list in a literal way.

So, do you know how to read data from a file, into a Python script?

PJh · January 27, 2023, 1:28am

Yes, I am reading the file line-by-line, but what logic must I apply to get the desired output

MRAB · January 27, 2023, 1:45am

I’d probably do something like this:

previous_ending = None

with open(input_path) as input_file, open(output_path, 'w') as output_file:
    for line in input_file:
        ending = line.rstrip('\n').partition('#')[2]

        if ending != previous_ending:
            output_file.write(line)
            previous_ending = ending

PJh · January 27, 2023, 2:28am

I am still getting similar output like

a,b#0
b,a#0
b,d#0.2
a,b#0
b,c#0.1
c,b#0.1

Still unable to solve the problem

MRAB · January 27, 2023, 2:57am

Ah, I misread it as being consecutive only.

Try this:

previous_endings = set()

with open(input_path) as input_file, open(output_path, 'w') as output_file:
    for line in input_file:
        ending = line.rstrip('\n').partition('#')[2]

        if ending not in previous_endings:
            output_file.write(line)
            previous_endings.add(ending)

PJh · January 27, 2023, 4:16am

I am getting the following output after implementing the code

a:b,0.3333333333333333
a:b,0.5
b:c,0.1111111111111111
b:c,0.2
b:c,0.14285714285714285
b:c,0.125
b:c,0.25
b:c,0.16666666666666666
a:c,0.1

Here I am missing a:d and a:e, and other important lines. Also repetition exist between pairs

rob42 · January 27, 2023, 4:49am

That is a completely different data set to the one with which you started this thread. How can anyone help you when you don’t provide relevant information from the get-go: it’s akin to being asked to navigate a maze while wearing a blindfold.

I see that you’ve done an edit to your first post and as such both the data and the question have been improved.

I have a script that produces the following, based on the data provided. If this is what you’re looking for, then let me know and I’ll post the script.

Output:
a:b,0.5
b:c,0.25
b:d,0.25
b:e,0.25
a:c,0.25
a:d,0.5
a:e,0.16666666666666666
c:d,0.25
c:e,0.5
d:e,0.25

MRAB · January 27, 2023, 6:29pm

Up until now you had numbers separated by ‘,’ and ‘#’, but now they’re separated by ‘:’, and ‘,’. That’s why the last code I posted isn’t working.

PJh · January 28, 2023, 1:44am

Yes, Rob please share the script. This is what I need exactly. Thank you

rob42 · January 28, 2023, 2:57am

Here you go and you’re welcome.

values = {}
key = 0
val = 1
remove = []

print("Reading data file.")
with open("data") as data:
    for line in data:
        sample = line.strip('\n').split(',')
        if sample[key] in values:
            check = values[sample[key]]
            if float(sample[val]) > check:
                values[sample[key]] = float(sample[val])
        else:
            values[sample[key]] = float(sample[val])

print("Building a list of items to remove...\n")
for key in values:
    key_list = list(key)
    new_key = str(key_list[2]) + ":" + str(key_list[0])
    sample = values[key]
    check = values[new_key]
    if sample == check:
        print(f"Found {key} {values[key]} == {new_key} {values[new_key]}")
        print(f"Marking {new_key} {values[new_key]} for removal.")
        values[new_key] = ''
        print()
    elif sample and sample > check:
        print(f"Found {key} {values[key]} > {new_key} {values[new_key]}")
        print(f"Marking {new_key} {values[new_key]} for removal.")
        values[new_key] = ''
        print()

for key in values:
    if not values[key]:
        remove.append(key)

print("Removing items...")
for key in remove:
    values.pop(key)
print("Done.")

print("Writing data to file..")
with open("output", mode='w') as output:
    for key in values:
        data = f"{key},{values[key]}\n"
        output.write(data)

print("Script exit")

PythonCHB · January 28, 2023, 7:35am

Rob’s code should work (I haven’t tested it,but i think you can make it simpler, cleaner. I’m not going to write it for you, but:

You want to build a dict with the keys being the letter pairs (a:b) and the values being the max value.

IIUC, the a:b is to be considered the same as b:a – yes? so you want the keys to be “Normalized”, which you can do by sorting them:

In [172]: pair
Out[172]: 'b:a'

In [173]: ':'.join(sorted(pair.split(':')))
Out[173]: 'a:b'

if you use that as the keys in your dict, then you’ll get the max value right.

to put the max value in the dict, you can use Rob’s code, though you can clean that up a bit – no need to call float multiple times …

Then no need to remove anything.

OK, I can’t help myself, here’s a nifty 2-liner for building your dict:

old_val = d.setdefault(key, new_val)
d[key] = max(old_val, new_val)

-CHB

rob42 · January 28, 2023, 7:55am

Some good points; I appreciate the feed back and your pointers for code improvements.

As to the removal of ‘dead items’, I did that because I think the provided data is simply a very small sample of a much larger data set and if it’s very large, then it would become resource heavy to retain data that is no longer required. As I’m unsure as to exactly how large the said data set is, I thought it best to err on the side of caution.

As this is nothing more than a coding exercise for me (with the added benefit of being of help to someone else) I’ll re-code this, taking into account the points you cite.

PJh · January 28, 2023, 8:19am

Yes, actually the previous data items need to be retained and compared. Please share the updated python script. The a:b is to be considered the same as b:a – yes? Yes

PythonCHB · January 29, 2023, 6:24am

Sure – but all you need to keep are the max values – there’s no extra data being kept around if you build the dict as you go, only keeping the keys and the max values.

rob42 · February 1, 2023, 3:09pm

I’ve been a little busy with more pressing issues (sure as making a living), but I think this to be a better solution than I had before:

values = {}

print("Reading data file.")
with open("data", mode="r", encoding="UTF-8") as data:
    for line in data:
        sample = line.strip('\n').split(',')
        S_KEY = sample[0]
        S_VALUE = float(sample[1])
        if S_KEY in values:
            KEY = S_KEY
            value = values[KEY]
            S_KEY = ":".join(reversed(KEY.split(":")))
            if S_VALUE > value:
                values[KEY] = S_VALUE
        elif ":".join(reversed(S_KEY.split(":"))) in values:
            KEY = ":".join(reversed(S_KEY.split(":")))
            value = values[KEY]
            if S_VALUE > value:
                values[KEY] = S_VALUE
        else:
            values[S_KEY] = S_VALUE

print("Writing data to file..")
with open("output", mode='w', encoding="UTF-8") as output:
    for key in values:
        data = f"{key},{values[key]}\n"
        output.write(data)

print("Script exit")

With thanks to @PythonCHB for the feedback.