How to separate data using path concept from a CSV file to another CSV file

akib62 · November 22, 2020, 12:22pm

Hello Good People,

I have a CSV file that contains information about a path including the weight of the path. The dataset sample is given below:

start	end	weight
A1	A2	0.6
A2	A5	0.5
A3	A1	0.75
A4	A5	0.88
A5	A3	0.99
(+)-1(10),4-Cadinadiene	Falcarinone	0.09
Leucodelphinidin	(+)-1(10),4-Cadinadiene	0.876
Lignin	(2E,7R,11R)-2-Phyten-1-ol	0.778
(2E,7R,11R)-2-Phyten-1-ol	Leucodelphinidin	0.55
Falcarinone	Lignin	1
A1	(+)-1(10),4-Cadinadiene	1
A2	Lignin	1
A3	(2E,7R,11R)-2-Phyten-1-ol	1
A4	Leucodelphinidin	1
A5	Falcarinone	1

Now I want to create another CSV file based on the path concept. For example, From A1 I can visit A2 and from A2 I can visit Lignin (A1 -> A2 -> Lignin). So, we can say, we can visit Lignin from A1. In the same way, we can visit Falcarinone from A1 (A1 -> A2 -> A5 -> Falcarinone).

However, I am not interested to contain data from (A1 -> A2 -> A5 -> A3 -> (2E,7R,11R)-2-Phyten-1-ol). That’s mean, I want to take A at least 2 times and at most 3 times.

My expected 2nd CSF file will contain only this kind of information:

start	end	Total weight
A1	Lignin	1.6
A1	Falcarinone	2.1

I have no idea how can I do this task and I am extremely sorry that I do not have any reproducible code.

I will be grateful if you give any suggestions or ideas.

cameron · November 22, 2020, 11:29pm

I have a CSV file that contains information about a path including the
weight of the path. The dataset sample is given below:

start>end>weight>
—|—|—|
A1|A2|0.6|
A2|A5|0.5|
A3|A1|0.75|
A4|A5|0.88|
A5|A3|0.99|
(+)-1(10),4-Cadinadiene|Falcarinone|0.09|
Leucodelphinidin>(+)-1(10),4-Cadinadiene|0.876|
Lignin>(2E,7R,11R)-2-Phyten-1-ol|0.778|
(2E,7R,11R)-2-Phyten-1-ol|Leucodelphinidin|0.55|
Falcarinone>Lignin>1|
A1|(+)-1(10),4-Cadinadiene|1|
A2|Lignin|1|
A3|(2E,7R,11R)-2-Phyten-1-ol|1|
A4|Leucodelphinidin|1|
A5|Falcarinone|1|

Now I want to create another CSV file based on the path concept. For example, From A1 I can visit A2 and from A2 I can visit Lignin (A1 -> A2 -> Lignin). So, we can say, we can visit Lignin from A1. In the same way, we can visit Falcarinone from A1 (A1 -> A2 -> A5 -> Falcarinone).

You’ll need a few Python modules to help you. The Python docs are here:

https://docs.python.org/3/

and there’s a section fr each module mentioned below.

To process this in Python you will need a mapping from “start” to the
various "end"s reachable, and (from lower in your post) the weight.

So you will want a 2-tuple for each endpoint like this, expressing its
name and weight:

("A2", 0.6)

and since you can have multiple "end"s for each “start” you will want a
list of them. For example:

[("A2",0.6), ("(+)-1(10),4-Cadinadiene",1)]

Then assemble a mapping from “start” to a list for the various ends:

{ "A1": [("A2",0.6), ("(+)-1(10),4-Cadinadiene",1)],
.......
}

filled out for the various "start"s and "end"s. This mapping lets you
enumerate the places you can go from an arbitrary “start”.

I would recommend a defaultdict for that mapping, which is a special
kind of dict which autocreates missing elements:

# at the top of your script
from collections import defaultdict

# set up the mapping (initially empty)
start_end = defaultdict(list)

That way you’r eguarrenteed that each element is a list, so that you can
append to it.

Since you have a CSV file as input, use the “csv” module to read your
file and fill out the mapping:

# at the top of your script
import csv

# scan the data file
for start, end, weight in csv.reader("your-datafilename-here.csv"):
    append the value (end,weight) to the entry in start_end for start

Print out start, end, weight as you read them to be sure of what’s going
on.

However, I am not interested to contain data from (A1 -> A2 -> A5 -> A3 -> (2E,7R,11R)-2-Phyten-1-ol). That’s mean, I want to take A at
least 2 times and at most 3 times.

Here is where things get tricky. You want a recursive function to walk
your mapping from a starting name eg “A1” and return every reachable
“end” at least 2 steps away and no more than 3 steps away, and the sum
of the weights in each step taken.

I would write a function accepting these things:

the mapping
the original starting name
the “current” start name, representing where you are right now in
the graph
total weight so far
total steps so far
minimum steps
maximum steps

It would be a recursive function, which calls itself to continue from
each “start”:

def step(start_end, start0, start, weight_so_far, steps_so_far, min_steps, max_steps):
    ....

Think about what the function must do:

find all the ends from “start”, which is just start_end[start]
loop over those (end,weight) pairs
for each end, if you’ve reached min_steps, print it out along with
start0
if you haven’t reached max_steps, call the function again with updated
values:
step(start_end, start0, end, weight_so_far+weight, steps_so_far+1, min_steps, max_steps)

Put in lots of print() calls, they will help you see what is happening.

My expected 2nd CSF file will contain only this kind of information:

start>end>Total weight|
—|—|—|
A1|Lignin|1.6|
A1 |Falcarinone|2.1|

You can write this with the “csv” module again; make a “csv.writer” and
give it each output row - that row will be the values at the “print it
out with start0” step above.

I’d do this in steps:

load the original CSV data into your mapping and print that out to
make sure it is correct.
after that is working, write the recursive function to walk the
mapping like a graph. Put in print(0 calls to see what’s going on.
after the walking is working, add the csv.writer stuff to write the
data to a file instead of just printing it out; that file can be your
standard output, BTW, so that it lands on your screen:

csvw = csv.writer(sys.stdout)

The pprint module has a handy function for printing complex things:

# at the top of your script
from pprint import pprint

# when you want to print, for example the start_end mapping
pprint(start_end)

It is easier on the eyes than print(start_end) or print(repr(start_end))
which are your more basic choices.

Cheers,
Cameron Simpson cs@cskk.id.au

akib62 · November 23, 2020, 9:31am

@cameron thank you very much for your perfect guidelines.

I was also trying similar kinds of these. However, I will follow your guidlines

cameron · November 23, 2020, 9:51pm

The question is complicated enough that if you’re genuinely new to
Python it is pretty hard (because you need several things, all of which
may be new to you).

If you get stuck, come back with your code and of course the failing
output, etc etc.

Cheers,
Cameron Simpson cs@cskk.id.au