I was working with a group on some code to insert, at random, 100 nucleotide sequences (strings) in a FASTA genome (body of text) based on their length.
We designed a small test-case for one chromosome and everything seemed to be going well.
The ultimate goal would be to do so genome-wide for all chromosomes in the FASTA genome; however, here there are some problems:
- as we implemented the code the various chromosomes end up in a list of strings, what we will need is to randomly assign each one of the selected sequences to any of the chromosome e.g. each one chromosome can contain an
x
number of strings which can differ among chromosome with the total count genome wide not surpassingone hundred
.
I share below the code we used. At first, I thought to use a for
loop to implement this feature but either it is not the correct approach, or I’m doing something wrong since I cannot make it work… any help is much appreciated!
P. S. we stored the header of each chromosome separately from the body, as we don’t want anything to be added there, in theory we should be able to parse back the correct headers in the modified chromosomes e.g. post sequence addition. See chr_retros
in the code.
EDIT with code of interest only, output and expected results
###library import
import random
###main function
def add_retro_text(genome, all_retros):
retro_of_choice = [retro for retro in all_retros if 400 < len(retro) < 500]
hundred_retros = random.sample(retro_of_choice, k=100)
indices = sorted(random.randrange(len(genome)) for _ in hundred_retros)
slices = [genome[i:j] for i, j in zip([0] + indices, indices + [len(genome)])] #I tried to edit here with a loop but couldn't quite get the mechanic probably...
start = 0
offsets = [(start := start + len(infix), start := start + len(retro)) for infix, retro in zip(slices, hundred_retros)]
new_genome = "".join("".join(pairs) for pairs in zip(slices, hundred_retros)) + slices[-1]
return new_genome, offsets
###nice ordinal output of insertion points
def ordinal(n: int):
if 11 <= (n % 100) <= 13:
suffix = 'th'
else:
suffix = ['th', 'st', 'nd', 'rd', 'th'][min(n % 10, 4)]
return str(n) + suffix
# get both the genome and the information about the insertion points
big_genome, positions = add_retro_text(body, s)
#chr_retros = "\n".join([head, big_genome]) #tested on one chromosome works
print(chr_retros + "\n\n" + "This is where retros have been added:" + "\n")
for r in range(len(positions)):
print(f" The {str(ordinal(r+1))} retrocopy has been added to the following position in the genome {str(positions[r])}")
OUTPUT for the first chromosome —truncated after the end of the first string inserted (the string start after the N characters)
NC_000001.10
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNATGGCACCGAAAGCGAAGAAGGAAGCTCCTGCCCCTCCTAAAGCTGAAGCCAAAGCGAAGGCTTTAAAGGCCAAGAAGGCAGTGTTGAACGGTGTCCACAGCCACAAAAAAAAGATCCACACGTCACCCACCTTCCGGCGGCCCAAGACACTGCGACTCCGGAGACAGCCCGAATATCCTCGGAAGAGCACTCCCAGGAAAAACAAGCTTGACCACTATGCTATCATCAAGTTTCCGCTGACCACTGAGTCTGCCATGAAGAAGATAGAAGACAACAACACACTTGTGTTCATTGTGGTTGTTAAAGCCAACAAGCACCAGATTAAACAGGCTGTGAAGTAGCTCTATGACATTGATGTGGCCAAGGTCAACACCCTGATTCGGCCTGATGGAGAGAAGAAGGCATATGTTCGACTGGCTCCTGATTACGATGCTTTGGATGTTGTCAACAAAATTGGGATCATC...
This is where retros have been added:
The 1st retrocopy has been added to the following position in the genome (636, 1101)
EXPECTED OUTPUT — for each one of the chromosomes in the genome I expect to have a random number of strings of selected length (between 400 and 500 characters) to be inserted within their sequence without exceeding the 100 total strings added to the FASTA genome
NC_000001.10
...some sequnce...STRING...some sequnce...STRING
NC_000002.11
...some sequnce...STRING...some sequnce...STRING...some sequnce...STRING...some sequnce... STRING
NC_000003.11
...some sequnce...STRING
.
.
.
The number of strings added in each chromosome doesn’t have to be the same as long as they are 100 in total among all chromosomes in the genome; these are stored as a list of strings, but the code above works only for a single instance…