Hi there, I’m working on a bioinformatics project where I need to import, edit and export a FASTA sequence.
Now, I managed to import and edit the sequence; however, once I export it I need the header to be placed before the body of each chromosome sequence followed by an enter/new-line. This has to repeat for all chromosomes.
Below, an example of the intended final output:
>chr_header1
GGACTTGAAACAGGGAGTCCTGCACGACCCGTATTGCACGTTAAAGGCAGGCCACACTGTTCCCGATATCAAAGCCCAAACGTGTGAGTATTGGAATTCACGGCGGAAGGTTCACCATTCGTCTATAGAAATTTTCATCAACCCGGAACT...
>chr_header2
AACAATGCTGTACCAGACCCCCTAACCTCCTCAGGTGATAAGTGTCTGACTGCTGACTTGTCCTAAATTGTCCGCTAGAATGGAATCCTAACGTTCGAAATACTTATTCGGTATACCAGGGTCGGCATTTTATTTTCGTCGATTATTTCG...
.
.
.
This is the code I’m using
###library import
from Bio import SeqIO
import random
###external strings handling, to be added into the FASTA
input_file = open("hs_retro.txt")
my_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))
#my_dict
###compute sequence length
l = []
for v in my_dict.values():
l.append(len(v))
print(l)
###extract string sequences
s = []
for k in my_dict.values():
s.append(k)
# if you have items with a seq attribute, then extract that first:
s = [str(item.seq) for item in s]
#print([s])
###import FASTA
def fasta_reader(filename):
from Bio.SeqIO.FastaIO import FastaIterator
with open(filename) as handle:
for record in FastaIterator(handle):
yield record
head = []
body = []
#head = ""
#body = ""
for entry in fasta_reader("hg37.fasta"):
head.append(str(entry.id))
body.append(str(entry.seq))
#head = str(entry.id)
#body = str(entry.seq)
#print(head)
#print(body)
header = [">" + h for h in head]
#THIS PART NEED TO BE CHANGED...
grch37 = "\n".join(map(str, [a + b for a,b in zip(header,body)]))
with open("/path/to/grch37.fasta", "w") as f:
f.write(grch37)
#chr1 = "\n".join(map(str, [header[0], body[0]]))
#with open("/Users/matte/Downloads/chr1.fasta", "w") as f:
# f.write(chr1)
The problem with this approach is that when I output the file to a desired location the header is stitched to the sequence of nucleotides in the various chromosomes, as such:
>chr_header1GGACTTGAAACAGGGAGTCCTGCACGACCCGTATTGCACGTTAAAGGCAGGCCACACTGTTCCCGATATCAAAGCCCAAACGTGTGAGTATTGGAATTCACGGCGGAAGGTTCACCATTCGTCTATAGAAATTTTCATCAACCCGGAACT...
Also, I’m not entirely sure the code is inserting a \n
between the end of each chromosome and the header of the following one, as specified in the join
method… any help is much appreciated, thanks in advance!