I have a CSV file that contains information about a path including the
weight of the path. The dataset sample is given below:
start>end>weight>
—|—|—|
A1|A2|0.6|
A2|A5|0.5|
A3|A1|0.75|
A4|A5|0.88|
A5|A3|0.99|
(+)-1(10),4-Cadinadiene|Falcarinone|0.09|
Leucodelphinidin>(+)-1(10),4-Cadinadiene|0.876|
Lignin>(2E,7R,11R)-2-Phyten-1-ol|0.778|
(2E,7R,11R)-2-Phyten-1-ol|Leucodelphinidin|0.55|
Falcarinone>Lignin>1|
A1|(+)-1(10),4-Cadinadiene|1|
A2|Lignin|1|
A3|(2E,7R,11R)-2-Phyten-1-ol|1|
A4|Leucodelphinidin|1|
A5|Falcarinone|1|
Now I want to create another CSV file based on the path concept. For example, From A1 I can visit A2 and from A2 I can visit Lignin (A1 -> A2 -> Lignin)
. So, we can say, we can visit Lignin from A1. In the same way, we can visit Falcarinone from A1 (A1 -> A2 -> A5 -> Falcarinone)
.
You’ll need a few Python modules to help you. The Python docs are here:
https://docs.python.org/3/
and there’s a section fr each module mentioned below.
To process this in Python you will need a mapping from “start” to the
various "end"s reachable, and (from lower in your post) the weight.
So you will want a 2-tuple for each endpoint like this, expressing its
name and weight:
("A2", 0.6)
and since you can have multiple "end"s for each “start” you will want a
list of them. For example:
[("A2",0.6), ("(+)-1(10),4-Cadinadiene",1)]
Then assemble a mapping from “start” to a list for the various ends:
{ "A1": [("A2",0.6), ("(+)-1(10),4-Cadinadiene",1)],
.......
}
filled out for the various "start"s and "end"s. This mapping lets you
enumerate the places you can go from an arbitrary “start”.
I would recommend a defaultdict for that mapping, which is a special
kind of dict which autocreates missing elements:
# at the top of your script
from collections import defaultdict
# set up the mapping (initially empty)
start_end = defaultdict(list)
That way you’r eguarrenteed that each element is a list, so that you can
append to it.
Since you have a CSV file as input, use the “csv” module to read your
file and fill out the mapping:
# at the top of your script
import csv
# scan the data file
for start, end, weight in csv.reader("your-datafilename-here.csv"):
append the value (end,weight) to the entry in start_end for start
Print out start, end, weight as you read them to be sure of what’s going
on.
However, I am not interested to contain data from (A1 -> A2 -> A5 -> A3 -> (2E,7R,11R)-2-Phyten-1-ol)
. That’s mean, I want to take A
at
least 2 times
and at most 3 times
.
Here is where things get tricky. You want a recursive function to walk
your mapping from a starting name eg “A1” and return every reachable
“end” at least 2 steps away and no more than 3 steps away, and the sum
of the weights in each step taken.
I would write a function accepting these things:
- the mapping
- the original starting name
- the “current” start name, representing where you are right now in
the graph
- total weight so far
- total steps so far
- minimum steps
- maximum steps
It would be a recursive function, which calls itself to continue from
each “start”:
def step(start_end, start0, start, weight_so_far, steps_so_far, min_steps, max_steps):
....
Think about what the function must do:
- find all the ends from “start”, which is just start_end[start]
- loop over those (end,weight) pairs
- for each end, if you’ve reached min_steps, print it out along with
start0
- if you haven’t reached max_steps, call the function again with updated
values:
step(start_end, start0, end, weight_so_far+weight, steps_so_far+1, min_steps, max_steps)
Put in lots of print() calls, they will help you see what is happening.
My expected 2nd CSF file will contain only this kind of information:
start>end>Total weight|
—|—|—|
A1|Lignin|1.6|
A1 |Falcarinone|2.1|
You can write this with the “csv” module again; make a “csv.writer” and
give it each output row - that row will be the values at the “print it
out with start0” step above.
I’d do this in steps:
-
load the original CSV data into your mapping and print that out to
make sure it is correct.
-
after that is working, write the recursive function to walk the
mapping like a graph. Put in print(0 calls to see what’s going on.
-
after the walking is working, add the csv.writer stuff to write the
data to a file instead of just printing it out; that file can be your
standard output, BTW, so that it lands on your screen:
csvw = csv.writer(sys.stdout)
The pprint module has a handy function for printing complex things:
# at the top of your script
from pprint import pprint
# when you want to print, for example the start_end mapping
pprint(start_end)
It is easier on the eyes than print(start_end) or print(repr(start_end))
which are your more basic choices.
Cheers,
Cameron Simpson cs@cskk.id.au