I have a CSV file that contains information about a path including the
weight of the path. The dataset sample is given below:
start>end>weight>
â€”â€”
A1A20.6
A2A50.5
A3A10.75
A4A50.88
A5A30.99
(+)1(10),4CadinadieneFalcarinone0.09
Leucodelphinidin>(+)1(10),4Cadinadiene0.876
Lignin>(2E,7R,11R)2Phyten1ol0.778
(2E,7R,11R)2Phyten1olLeucodelphinidin0.55
Falcarinone>Lignin>1
A1(+)1(10),4Cadinadiene1
A2Lignin1
A3(2E,7R,11R)2Phyten1ol1
A4Leucodelphinidin1
A5Falcarinone1
Now I want to create another CSV file based on the path concept. For example, From A1 I can visit A2 and from A2 I can visit Lignin (A1 > A2 > Lignin)
. So, we can say, we can visit Lignin from A1. In the same way, we can visit Falcarinone from A1 (A1 > A2 > A5 > Falcarinone)
.
Youâ€™ll need a few Python modules to help you. The Python docs are here:
https://docs.python.org/3/
and thereâ€™s a section fr each module mentioned below.
To process this in Python you will need a mapping from â€śstartâ€ť to the
various "end"s reachable, and (from lower in your post) the weight.
So you will want a 2tuple for each endpoint like this, expressing its
name and weight:
("A2", 0.6)
and since you can have multiple "end"s for each â€śstartâ€ť you will want a
list of them. For example:
[("A2",0.6), ("(+)1(10),4Cadinadiene",1)]
Then assemble a mapping from â€śstartâ€ť to a list for the various ends:
{ "A1": [("A2",0.6), ("(+)1(10),4Cadinadiene",1)],
.......
}
filled out for the various "start"s and "end"s. This mapping lets you
enumerate the places you can go from an arbitrary â€śstartâ€ť.
I would recommend a defaultdict for that mapping, which is a special
kind of dict which autocreates missing elements:
# at the top of your script
from collections import defaultdict
# set up the mapping (initially empty)
start_end = defaultdict(list)
That way youâ€™r eguarrenteed that each element is a list, so that you can
append to it.
Since you have a CSV file as input, use the â€ścsvâ€ť module to read your
file and fill out the mapping:
# at the top of your script
import csv
# scan the data file
for start, end, weight in csv.reader("yourdatafilenamehere.csv"):
append the value (end,weight) to the entry in start_end for start
Print out start, end, weight as you read them to be sure of whatâ€™s going
on.
However, I am not interested to contain data from (A1 > A2 > A5 > A3 > (2E,7R,11R)2Phyten1ol)
. Thatâ€™s mean, I want to take A
at
least 2 times
and at most 3 times
.
Here is where things get tricky. You want a recursive function to walk
your mapping from a starting name eg â€śA1â€ť and return every reachable
â€śendâ€ť at least 2 steps away and no more than 3 steps away, and the sum
of the weights in each step taken.
I would write a function accepting these things:
 the mapping
 the original starting name
 the â€ścurrentâ€ť start name, representing where you are right now in
the graph
 total weight so far
 total steps so far
 minimum steps
 maximum steps
It would be a recursive function, which calls itself to continue from
each â€śstartâ€ť:
def step(start_end, start0, start, weight_so_far, steps_so_far, min_steps, max_steps):
....
Think about what the function must do:
 find all the ends from â€śstartâ€ť, which is just start_end[start]
 loop over those (end,weight) pairs
 for each end, if youâ€™ve reached min_steps, print it out along with
start0
 if you havenâ€™t reached max_steps, call the function again with updated
values:
step(start_end, start0, end, weight_so_far+weight, steps_so_far+1, min_steps, max_steps)
Put in lots of print() calls, they will help you see what is happening.
My expected 2nd CSF file will contain only this kind of information:
start>end>Total weight 
A1 
A1 
You can write this with the â€ścsvâ€ť module again; make a â€ścsv.writerâ€ť and
give it each output row  that row will be the values at the â€śprint it
out with start0â€ť step above.
Iâ€™d do this in steps:

load the original CSV data into your mapping and print that out to
make sure it is correct.

after that is working, write the recursive function to walk the
mapping like a graph. Put in print(0 calls to see whatâ€™s going on.

after the walking is working, add the csv.writer stuff to write the
data to a file instead of just printing it out; that file can be your
standard output, BTW, so that it lands on your screen:
csvw = csv.writer(sys.stdout)
The pprint module has a handy function for printing complex things:
# at the top of your script
from pprint import pprint
# when you want to print, for example the start_end mapping
pprint(start_end)
It is easier on the eyes than print(start_end) or print(repr(start_end))
which are your more basic choices.
Cheers,
Cameron Simpson cs@cskk.id.au