Re-adapt function to array of strings

I was working with a group on some code to insert, at random, 100 nucleotide sequences (strings) in a FASTA genome (body of text) based on their length.

We designed a small test-case for one chromosome and everything seemed to be going well.
The ultimate goal would be to do so genome-wide for all chromosomes in the FASTA genome; however, here there are some problems:

  • as we implemented the code the various chromosomes end up in a list of strings, what we will need is to randomly assign each one of the selected sequences to any of the chromosome e.g. each one chromosome can contain an x number of strings which can differ among chromosome with the total count genome wide not surpassing one hundred.

I share below the code we used. At first, I thought to use a for loop to implement this feature but either it is not the correct approach, or I’m doing something wrong since I cannot make it work… any help is much appreciated!

P. S. we stored the header of each chromosome separately from the body, as we don’t want anything to be added there, in theory we should be able to parse back the correct headers in the modified chromosomes e.g. post sequence addition. See chr_retros in the code.

EDIT with code of interest only, output and expected results

###library import
import random


###main function
def add_retro_text(genome, all_retros):
    retro_of_choice = [retro for retro in all_retros if 400 < len(retro) < 500]
    hundred_retros = random.sample(retro_of_choice, k=100)
    
    indices = sorted(random.randrange(len(genome)) for _ in hundred_retros)
    
    slices = [genome[i:j] for i, j in zip([0] + indices, indices + [len(genome)])] #I tried to edit here with a loop but couldn't quite get the mechanic probably...

    start = 0
    offsets = [(start := start + len(infix), start := start + len(retro)) for infix, retro in zip(slices, hundred_retros)]

    new_genome = "".join("".join(pairs) for pairs in zip(slices, hundred_retros)) + slices[-1]

    return new_genome, offsets
###nice ordinal output of insertion points
def ordinal(n: int):
    if 11 <= (n % 100) <= 13:
        suffix = 'th'
    else:
        suffix = ['th', 'st', 'nd', 'rd', 'th'][min(n % 10, 4)]
    return str(n) + suffix

# get both the genome and the information about the insertion points
big_genome, positions = add_retro_text(body, s)
#chr_retros = "\n".join([head, big_genome]) #tested on one chromosome works

print(chr_retros + "\n\n" + "This is where retros have been added:" + "\n")

for r in range(len(positions)):
    print(f" The {str(ordinal(r+1))} retrocopy has been added to the following position in the genome {str(positions[r])}")

OUTPUT for the first chromosome —truncated after the end of the first string inserted (the string start after the N characters)

NC_000001.10
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNATGGCACCGAAAGCGAAGAAGGAAGCTCCTGCCCCTCCTAAAGCTGAAGCCAAAGCGAAGGCTTTAAAGGCCAAGAAGGCAGTGTTGAACGGTGTCCACAGCCACAAAAAAAAGATCCACACGTCACCCACCTTCCGGCGGCCCAAGACACTGCGACTCCGGAGACAGCCCGAATATCCTCGGAAGAGCACTCCCAGGAAAAACAAGCTTGACCACTATGCTATCATCAAGTTTCCGCTGACCACTGAGTCTGCCATGAAGAAGATAGAAGACAACAACACACTTGTGTTCATTGTGGTTGTTAAAGCCAACAAGCACCAGATTAAACAGGCTGTGAAGTAGCTCTATGACATTGATGTGGCCAAGGTCAACACCCTGATTCGGCCTGATGGAGAGAAGAAGGCATATGTTCGACTGGCTCCTGATTACGATGCTTTGGATGTTGTCAACAAAATTGGGATCATC...

This is where retros have been added:

The 1st retrocopy has been added to the following position in the genome (636, 1101)

EXPECTED OUTPUT — for each one of the chromosomes in the genome I expect to have a random number of strings of selected length (between 400 and 500 characters) to be inserted within their sequence without exceeding the 100 total strings added to the FASTA genome

NC_000001.10
...some sequnce...STRING...some sequnce...STRING
NC_000002.11
...some sequnce...STRING...some sequnce...STRING...some sequnce...STRING...some sequnce... STRING
NC_000003.11
...some sequnce...STRING
.
.
.

The number of strings added in each chromosome doesn’t have to be the same as long as they are 100 in total among all chromosomes in the genome; these are stored as a list of strings, but the code above works only for a single instance…

Hello,

There are a few moving parts here.

  1. Generally, when opening a file, you have to specify its purpose. Are you opening it for
    reading or writing? This has to be specified. A few options to consider:

    r’ - reading
    w’ - writing

  2. It is recommended to use the with open context manager when opening files since it
    will close the file operation automatically once completed even in cases of
    exceptions. If you use the open method, once done, you have to explicitly close it. Note
    that in your script you did not close it and thus the object is left floating.

    So, if you are reading the file, you may try something like this (I am assuming that you want to pass the contents of the file to the SeqIO method):

    with open('homo_sapians_sequences.txt', 'r', encode='UTF-8') as input_file:
    
        file_contents = input_file.read()
        my_dict = SeqIO.to_dict(SeqIO.parse(file_contents, "fasta"))       
    

As you’re building your script, continously verify that it is progressing as you expect it to. Do not wait until the end to test it. For example, you can immediately print the variable my_dict to verify if its contents are what you were expecting. If they are not, there is no point in continuing with writing the rest of the script. First resolve the issue in front of you. Otherwise, the issue(s) will just propagate along and will then make it more cumbersome to debug. If the contents ARE what you were expecting, then move along with the first for loop in the script. If the contents of the s list are not what you were expecting, then you know with certaintly that it is the for loop that is having an issue.

That is what I would recommend. Build your script with continous testing / verification as you’re creating it.

I am not familiar with the Bio package module so I cannot offer much help there.

2 Likes

@onePythonUser really appreciated the heads-up about the with open context manager since I’m green hand with Python.

About the ongoing/continuous testing is something I’m already doing, I omitted the lines of code here because I didn’t think they might have been useful for someone else to address the issue; however, for the future I will keep in mind to leave them in.

No worries about the Bio package I also hoped there were some other ways to easily (as for a novice) to handle biological data in Python

P.S. I added in the main issue the controls/check-points where I was trying to understand if the code was doing what intended to

If you can paste the exact part of the script that you would like help with, that would be helpful to folks who are reviewing your script. Sometimes its the setting the data up part that is most challenging to new users versus using the end package module (i.e., Bio) since after that, its usually a plug and play.

If you provide it in this form, even more helpful similar to your previous post:

When I provide this, I should get that, but instead I get this.

Yes, that makes sense. I might just edit the main part to show only the incriminated function, what it does on a single sequence, and what I’m expected it to do in case of a full genome.