How to separate data using path concept from a CSV file to another CSV file

Hello Good People,

I have a CSV file that contains information about a path including the weight of the path. The dataset sample is given below:

start end weight
A1 A2 0.6
A2 A5 0.5
A3 A1 0.75
A4 A5 0.88
A5 A3 0.99
(+)-1(10),4-Cadinadiene Falcarinone 0.09
Leucodelphinidin (+)-1(10),4-Cadinadiene 0.876
Lignin (2E,7R,11R)-2-Phyten-1-ol 0.778
(2E,7R,11R)-2-Phyten-1-ol Leucodelphinidin 0.55
Falcarinone Lignin 1
A1 (+)-1(10),4-Cadinadiene 1
A2 Lignin 1
A3 (2E,7R,11R)-2-Phyten-1-ol 1
A4 Leucodelphinidin 1
A5 Falcarinone 1

Now I want to create another CSV file based on the path concept. For example, From A1 I can visit A2 and from A2 I can visit Lignin (A1 -> A2 -> Lignin). So, we can say, we can visit Lignin from A1. In the same way, we can visit Falcarinone from A1 (A1 -> A2 -> A5 -> Falcarinone).

However, I am not interested to contain data from (A1 -> A2 -> A5 -> A3 -> (2E,7R,11R)-2-Phyten-1-ol). That’s mean, I want to take A at least 2 times and at most 3 times.

My expected 2nd CSF file will contain only this kind of information:

start end Total weight
A1 Lignin 1.6
A1 Falcarinone 2.1

I have no idea how can I do this task and I am extremely sorry that I do not have any reproducible code.

I will be grateful if you give any suggestions or ideas.

1 Like

I have a CSV file that contains information about a path including the
weight of the path. The dataset sample is given below:

start>end>weight>
—|—|—|
A1|A2|0.6|
A2|A5|0.5|
A3|A1|0.75|
A4|A5|0.88|
A5|A3|0.99|
(+)-1(10),4-Cadinadiene|Falcarinone|0.09|
Leucodelphinidin>(+)-1(10),4-Cadinadiene|0.876|
Lignin>(2E,7R,11R)-2-Phyten-1-ol|0.778|
(2E,7R,11R)-2-Phyten-1-ol|Leucodelphinidin|0.55|
Falcarinone>Lignin>1|
A1|(+)-1(10),4-Cadinadiene|1|
A2|Lignin|1|
A3|(2E,7R,11R)-2-Phyten-1-ol|1|
A4|Leucodelphinidin|1|
A5|Falcarinone|1|

Now I want to create another CSV file based on the path concept. For example, From A1 I can visit A2 and from A2 I can visit Lignin (A1 -> A2 -> Lignin). So, we can say, we can visit Lignin from A1. In the same way, we can visit Falcarinone from A1 (A1 -> A2 -> A5 -> Falcarinone).

You’ll need a few Python modules to help you. The Python docs are here:

https://docs.python.org/3/

and there’s a section fr each module mentioned below.

To process this in Python you will need a mapping from “start” to the
various "end"s reachable, and (from lower in your post) the weight.

So you will want a 2-tuple for each endpoint like this, expressing its
name and weight:

("A2", 0.6)

and since you can have multiple "end"s for each “start” you will want a
list of them. For example:

[("A2",0.6), ("(+)-1(10),4-Cadinadiene",1)]

Then assemble a mapping from “start” to a list for the various ends:

{ "A1": [("A2",0.6), ("(+)-1(10),4-Cadinadiene",1)],
.......
}

filled out for the various "start"s and "end"s. This mapping lets you
enumerate the places you can go from an arbitrary “start”.

I would recommend a defaultdict for that mapping, which is a special
kind of dict which autocreates missing elements:

# at the top of your script
from collections import defaultdict

# set up the mapping (initially empty)
start_end = defaultdict(list)

That way you’r eguarrenteed that each element is a list, so that you can
append to it.

Since you have a CSV file as input, use the “csv” module to read your
file and fill out the mapping:

# at the top of your script
import csv

# scan the data file
for start, end, weight in csv.reader("your-datafilename-here.csv"):
    append the value (end,weight) to the entry in start_end for start

Print out start, end, weight as you read them to be sure of what’s going
on.

However, I am not interested to contain data from (A1 -> A2 -> A5 -> A3 -> (2E,7R,11R)-2-Phyten-1-ol). That’s mean, I want to take A at
least 2 times and at most 3 times.

Here is where things get tricky. You want a recursive function to walk
your mapping from a starting name eg “A1” and return every reachable
“end” at least 2 steps away and no more than 3 steps away, and the sum
of the weights in each step taken.

I would write a function accepting these things:

  • the mapping
  • the original starting name
  • the “current” start name, representing where you are right now in
    the graph
  • total weight so far
  • total steps so far
  • minimum steps
  • maximum steps

It would be a recursive function, which calls itself to continue from
each “start”:

def step(start_end, start0, start, weight_so_far, steps_so_far, min_steps, max_steps):
    ....

Think about what the function must do:

  • find all the ends from “start”, which is just start_end[start]
  • loop over those (end,weight) pairs
  • for each end, if you’ve reached min_steps, print it out along with
    start0
  • if you haven’t reached max_steps, call the function again with updated
    values:
    step(start_end, start0, end, weight_so_far+weight, steps_so_far+1, min_steps, max_steps)

Put in lots of print() calls, they will help you see what is happening.

My expected 2nd CSF file will contain only this kind of information:

start>end>Total weight|
—|—|—|
A1|Lignin|1.6|
A1 |Falcarinone|2.1|

You can write this with the “csv” module again; make a “csv.writer” and
give it each output row - that row will be the values at the “print it
out with start0” step above.

I’d do this in steps:

  • load the original CSV data into your mapping and print that out to
    make sure it is correct.

  • after that is working, write the recursive function to walk the
    mapping like a graph. Put in print(0 calls to see what’s going on.

  • after the walking is working, add the csv.writer stuff to write the
    data to a file instead of just printing it out; that file can be your
    standard output, BTW, so that it lands on your screen:

    csvw = csv.writer(sys.stdout)

The pprint module has a handy function for printing complex things:

# at the top of your script
from pprint import pprint

# when you want to print, for example the start_end mapping
pprint(start_end)

It is easier on the eyes than print(start_end) or print(repr(start_end))
which are your more basic choices.

Cheers,
Cameron Simpson cs@cskk.id.au

1 Like

@cameron thank you very much for your perfect guidelines.

I was also trying similar kinds of these. However, I will follow your guidlines

1 Like

The question is complicated enough that if you’re genuinely new to
Python it is pretty hard (because you need several things, all of which
may be new to you).

If you get stuck, come back with your code and of course the failing
output, etc etc.

Cheers,
Cameron Simpson cs@cskk.id.au