Handling elements in a file

My data is a text file, as shown below:

a:b,0.3333333333333333
b:a,0.3333333333333333
a:b,0.5
b:a,0.5
a:b,0.3333333333333333
a:b,0.3333333333333333
b:c,0.1111111111111111
b:c,0.2
b:c,0.14285714285714285
b:c,0.125
b:c,0.14285714285714285
b:c,0.125
b:c,0.25
b:c,0.16666666666666666
b:c,0.14285714285714285
b:c,0.16666666666666666
c:b,0.1111111111111111
c:b,0.1111111111111111
c:b,0.1111111111111111
c:b,0.1111111111111111
c:b,0.1111111111111111
b:d,0.14285714285714285
b:d,0.2
b:d,0.16666666666666666
b:d,0.25
d:b,0.14285714285714285
d:b,0.14285714285714285
b:e,0.2
b:e,0.25
e:b,0.2
a:b,0.3333333333333333
a:b,0.5
b:a,0.3333333333333333
a:b,0.3333333333333333
a:b,0.5
b:a,0.3333333333333333
a:c,0.1
a:c,0.16666666666666666
a:c,0.1111111111111111
a:c,0.1
a:c,0.1
a:c,0.14285714285714285
a:c,0.16666666666666666
a:c,0.2
a:c,0.16666666666666666
a:c,0.25
a:d,0.1
a:d,0.125
a:d,0.25
a:d,0.5
a:e,0.16666666666666666
a:e,0.16666666666666666
c:b,0.1111111111111111
c:b,0.125
b:c,0.1111111111111111
c:b,0.2
c:b,0.25
b:c,0.1111111111111111
c:b,0.14285714285714285
c:b,0.16666666666666666
b:c,0.1111111111111111
c:b,0.125
c:b,0.14285714285714285
b:c,0.1111111111111111
c:b,0.14285714285714285
c:b,0.16666666666666666
b:c,0.1111111111111111
c:a,0.1
c:a,0.14285714285714285
c:a,0.16666666666666666
c:a,0.16666666666666666
c:a,0.1111111111111111
c:a,0.2
c:a,0.1
c:a,0.16666666666666666
c:a,0.1
a:c,0.25
c:d,0.14285714285714285
c:d,0.14285714285714285
c:d,0.16666666666666666
c:d,0.25
c:d,0.2
c:d,0.2
c:d,0.16666666666666666
c:d,0.16666666666666666
d:c,0.14285714285714285
c:d,0.25
c:e,0.16666666666666666
c:e,0.5
c:e,0.2
c:e,0.16666666666666666
c:e,0.16666666666666666
d:b,0.14285714285714285
d:b,0.16666666666666666
b:d,0.14285714285714285
d:b,0.2
d:b,0.25
b:d,0.14285714285714285
d:a,0.1
d:a,0.25
d:a,0.125
a:d,0.5
d:c,0.14285714285714285
d:c,0.16666666666666666
d:c,0.2
d:c,0.16666666666666666
c:d,0.14285714285714285
d:c,0.14285714285714285
d:c,0.25
d:c,0.2
d:c,0.16666666666666666
c:d,0.25
d:e,0.16666666666666666
d:e,0.25
e:b,0.2
e:b,0.25
b:e,0.2
e:a,0.16666666666666666
e:a,0.16666666666666666
e:c,0.16666666666666666
e:c,0.5
e:c,0.2
e:c,0.16666666666666666
e:c,0.16666666666666666
e:d,0.16666666666666666
e:d,0.25

For every given pair a, and b only that line should be selected that has maximum value. The order of pair can be different as in example below for pairs a and b
[a:b,0.3333333333333333
b:a,0.3333333333333333
a:b,0.5
b:a,0.5
a:b,0.3333333333333333
a:b,0.3333333333333333], only one output should come a:b,0.5,

Also for following pairs a and e, when the variable have same value
[a:e,0.16666666666666666
a:e,0.16666666666666666
e:a,0.16666666666666666
e:a,0.16666666666666666], the output should be only 1 line a:e,0.16666666666666666. There should be no duplicate values, even if variables are in reverse order

In order that you may learn how to obtain any list item at any given index; this script is a demonstration:

the_list = ['a', 'b', 'c', 'd']

for index, item in enumerate(the_list):
    print(f"index: {index} | item: {item}")
output:
index: 0 | item: a
index: 1 | item: b
index: 2 | item: c
index: 3 | item: d

Now that you can see how a list object is indexed, why not have a go at your problem and post back if you have any questions.

Still the problem stays:
the_list = [‘a,b#0’, ‘b,a#0’, ‘b,d#0.2’ ,‘b,c#0.1’, ‘c,b#0.1’]
output:
index: 0 | item: a,b#0
index: 1 | item: b,a#0
index: 2 | item: b,d#0.2
index: 3 | item: b,c#0.1
index:4 | item: c,b#0.1

How can I create a logic that can keep only index:0 item when it encounters a,b#0, at index:0 and b,a#0 at index:1. Similarly, the program should select only one of the two items from b,c#0.1, c,b#0.1. Please note index positions are not fixed, and the items are an excrept of larger dataset of similar items.

I’m confused by the entire question: You say “To collect only one value from the list…”, but go on to say “The program should output a,b#0 b,d#0.2, b,c#0.1”. How is that one value?

What are these values?
What does ‘a,b#0’ mean?
Is the list object a collection of 5 string objects, as you have posted?
Are you trying to extract only the numbers of said strings?
What’s the origin of this ‘problem’?

Yes, the list is object a collection of 5 string objects. I need to filter 1 out from a,b#0, b,a#0. The purpose is to if two variables, like a and b, in different order have similar value, here 0. Then only 1 of the two should be selected ( either a,b#0, or b,a#0) in the output list. Actually I have a text file containing such string objects in different lines like:
a,b#0
b,a#0
b,d#0.2
b,c#0.1
c,b#0.1
I need to create another text file containing only:
a,b#0
b,d#0.2
b,c#0.1

So, you have a file, from which you are trying extract certain items, based on some criteria.

When you first posted, you said “list”. The word list means something very different to a Python coder, reading a post on a Python Forum, which is the reasoning behind my first post. It is clear now, that you mean list in a literal way.

So, do you know how to read data from a file, into a Python script?

Yes, I am reading the file line-by-line, but what logic must I apply to get the desired output

I’d probably do something like this:

previous_ending = None

with open(input_path) as input_file, open(output_path, 'w') as output_file:
    for line in input_file:
        ending = line.rstrip('\n').partition('#')[2]

        if ending != previous_ending:
            output_file.write(line)
            previous_ending = ending

I am still getting similar output like

a,b#0
b,a#0
b,d#0.2
a,b#0
b,c#0.1
c,b#0.1

Still unable to solve the problem

Ah, I misread it as being consecutive only.

Try this:

previous_endings = set()

with open(input_path) as input_file, open(output_path, 'w') as output_file:
    for line in input_file:
        ending = line.rstrip('\n').partition('#')[2]

        if ending not in previous_endings:
            output_file.write(line)
            previous_endings.add(ending)
1 Like

I am getting the following output after implementing the code

a:b,0.3333333333333333
a:b,0.5
b:c,0.1111111111111111
b:c,0.2
b:c,0.14285714285714285
b:c,0.125
b:c,0.25
b:c,0.16666666666666666
a:c,0.1

Here I am missing a:d and a:e, and other important lines. Also repetition exist between pairs

That is a completely different data set to the one with which you started this thread. How can anyone help you when you don’t provide relevant information from the get-go: it’s akin to being asked to navigate a maze while wearing a blindfold.


I see that you’ve done an edit to your first post and as such both the data and the question have been improved.

I have a script that produces the following, based on the data provided. If this is what you’re looking for, then let me know and I’ll post the script.

Output:
a:b,0.5
b:c,0.25
b:d,0.25
b:e,0.25
a:c,0.25
a:d,0.5
a:e,0.16666666666666666
c:d,0.25
c:e,0.5
d:e,0.25
2 Likes

Up until now you had numbers separated by ‘,’ and ‘#’, but now they’re separated by ‘:’, and ‘,’. That’s why the last code I posted isn’t working.

Yes, Rob please share the script. This is what I need exactly. Thank you

Here you go and you’re welcome.

values = {}
key = 0
val = 1
remove = []

print("Reading data file.")
with open("data") as data:
    for line in data:
        sample = line.strip('\n').split(',')
        if sample[key] in values:
            check = values[sample[key]]
            if float(sample[val]) > check:
                values[sample[key]] = float(sample[val])
        else:
            values[sample[key]] = float(sample[val])

print("Building a list of items to remove...\n")
for key in values:
    key_list = list(key)
    new_key = str(key_list[2]) + ":" + str(key_list[0])
    sample = values[key]
    check = values[new_key]
    if sample == check:
        print(f"Found {key} {values[key]} == {new_key} {values[new_key]}")
        print(f"Marking {new_key} {values[new_key]} for removal.")
        values[new_key] = ''
        print()
    elif sample and sample > check:
        print(f"Found {key} {values[key]} > {new_key} {values[new_key]}")
        print(f"Marking {new_key} {values[new_key]} for removal.")
        values[new_key] = ''
        print()

for key in values:
    if not values[key]:
        remove.append(key)

print("Removing items...")
for key in remove:
    values.pop(key)
print("Done.")

print("Writing data to file..")
with open("output", mode='w') as output:
    for key in values:
        data = f"{key},{values[key]}\n"
        output.write(data)

print("Script exit")
1 Like

Rob’s code should work (I haven’t tested it,but i think you can make it simpler, cleaner. I’m not going to write it for you, but:

You want to build a dict with the keys being the letter pairs (a:b) and the values being the max value.

IIUC, the a:b is to be considered the same as b:a – yes? so you want the keys to be “Normalized”, which you can do by sorting them:

In [172]: pair
Out[172]: 'b:a'

In [173]: ':'.join(sorted(pair.split(':')))
Out[173]: 'a:b'

if you use that as the keys in your dict, then you’ll get the max value right.

to put the max value in the dict, you can use Rob’s code, though you can clean that up a bit – no need to call float multiple times …

Then no need to remove anything.

OK, I can’t help myself, here’s a nifty 2-liner for building your dict:

old_val = d.setdefault(key, new_val)
d[key] = max(old_val, new_val)

-CHB

2 Likes

Some good points; I appreciate the feed back and your pointers for code improvements.

As to the removal of ‘dead items’, I did that because I think the provided data is simply a very small sample of a much larger data set and if it’s very large, then it would become resource heavy to retain data that is no longer required. As I’m unsure as to exactly how large the said data set is, I thought it best to err on the side of caution.

As this is nothing more than a coding exercise for me (with the added benefit of being of help to someone else) I’ll re-code this, taking into account the points you cite.

1 Like

Yes, actually the previous data items need to be retained and compared. Please share the updated python script. The a:b is to be considered the same as b:a – yes? Yes

Sure – but all you need to keep are the max values – there’s no extra data being kept around if you build the dict as you go, only keeping the keys and the max values.

1 Like

I’ve been a little busy with more pressing issues (sure as making a living), but I think this to be a better solution than I had before:

values = {}

print("Reading data file.")
with open("data", mode="r", encoding="UTF-8") as data:
    for line in data:
        sample = line.strip('\n').split(',')
        S_KEY = sample[0]
        S_VALUE = float(sample[1])
        if S_KEY in values:
            KEY = S_KEY
            value = values[KEY]
            S_KEY = ":".join(reversed(KEY.split(":")))
            if S_VALUE > value:
                values[KEY] = S_VALUE
        elif ":".join(reversed(S_KEY.split(":"))) in values:
            KEY = ":".join(reversed(S_KEY.split(":")))
            value = values[KEY]
            if S_VALUE > value:
                values[KEY] = S_VALUE
        else:
            values[S_KEY] = S_VALUE

print("Writing data to file..")
with open("output", mode='w', encoding="UTF-8") as output:
    for key in values:
        data = f"{key},{values[key]}\n"
        output.write(data)

print("Script exit")

With thanks to @PythonCHB for the feedback.

1 Like