Format and output FASTA file

Matteo · April 18, 2025, 2:29pm

Hi there, I’m working on a bioinformatics project where I need to import, edit and export a FASTA sequence.

Now, I managed to import and edit the sequence; however, once I export it I need the header to be placed before the body of each chromosome sequence followed by an enter/new-line. This has to repeat for all chromosomes.

Below, an example of the intended final output:

>chr_header1
GGACTTGAAACAGGGAGTCCTGCACGACCCGTATTGCACGTTAAAGGCAGGCCACACTGTTCCCGATATCAAAGCCCAAACGTGTGAGTATTGGAATTCACGGCGGAAGGTTCACCATTCGTCTATAGAAATTTTCATCAACCCGGAACT...
>chr_header2
AACAATGCTGTACCAGACCCCCTAACCTCCTCAGGTGATAAGTGTCTGACTGCTGACTTGTCCTAAATTGTCCGCTAGAATGGAATCCTAACGTTCGAAATACTTATTCGGTATACCAGGGTCGGCATTTTATTTTCGTCGATTATTTCG...
.
.
.

This is the code I’m using

 ###library import
from Bio import SeqIO
import random

###external strings handling, to be added into the FASTA
input_file = open("hs_retro.txt")
my_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))
#my_dict

###compute sequence length
l = []
for v in my_dict.values():
   l.append(len(v))
print(l)

###extract string sequences
s = []
for k in my_dict.values():
   s.append(k)
# if you have items with a seq attribute, then extract that first:
s = [str(item.seq) for item in s]
#print([s])

###import FASTA
def fasta_reader(filename):
  from Bio.SeqIO.FastaIO import FastaIterator
  with open(filename) as handle:
    for record in FastaIterator(handle):
      yield record

head = []
body = []
#head = ""
#body = ""
for entry in fasta_reader("hg37.fasta"):
    head.append(str(entry.id))
    body.append(str(entry.seq))
    #head = str(entry.id)
    #body = str(entry.seq)

#print(head)
#print(body)

header = [">" + h for h in head]
#THIS PART NEED TO BE CHANGED...
grch37 = "\n".join(map(str, [a + b for a,b in zip(header,body)]))
with open("/path/to/grch37.fasta", "w") as f:
    f.write(grch37)

#chr1 = "\n".join(map(str, [header[0], body[0]]))
#with open("/Users/matte/Downloads/chr1.fasta", "w") as f:
#    f.write(chr1)

The problem with this approach is that when I output the file to a desired location the header is stitched to the sequence of nucleotides in the various chromosomes, as such:

>chr_header1GGACTTGAAACAGGGAGTCCTGCACGACCCGTATTGCACGTTAAAGGCAGGCCACACTGTTCCCGATATCAAAGCCCAAACGTGTGAGTATTGGAATTCACGGCGGAAGGTTCACCATTCGTCTATAGAAATTTTCATCAACCCGGAACT...

Also, I’m not entirely sure the code is inserting a \n between the end of each chromosome and the header of the following one, as specified in the join method… any help is much appreciated, thanks in advance!

onePythonUser · April 18, 2025, 2:40pm

Hello,

Why not at the beginning of each sequence?

Matteo · April 18, 2025, 3:32pm

@onePythonUser the file is quite big, and I’m not entirely sure the code is doing what I intend to do… If I grep the > symbol this prints out the entire file instead of showing only the headers. So, after inspecting it I realized the script was not splitting the header from the body with a new-line.

I don’t have an easy way to assess whether is splitting the end of a chromosome and the beginning of the next by a \n…

onePythonUser · April 18, 2025, 3:39pm

What if you try something like this:

head = ['hello', 'goodbye', 'today', 'tomorrow', 'yesterday']

header = ['>' + h + '\n' for h in head]

for i in header:
    print(i)

Note how you add the \n when creating the header list, and you get this:

>hello

>goodbye

>today

>tomorrow

>yesterday

Without it (as it is currently written):

heading = ['>' + h for h in head]

for i in heading:
    print(i)

The output is:

>hello
>goodbye
>today
>tomorrow
>yesterday

jamestwebber · April 18, 2025, 3:59pm

Matteo Tommaso:

header = [">" + h for h in head]
#THIS PART NEED TO BE CHANGED...
grch37 = "\n".join(map(str, [a + b for a,b in zip(header,body)]))
with open("/path/to/grch37.fasta", "w") as f:
    f.write(grch37)

Breaking down this line:

grch37 = "\n".join(map(str, [a + b for a,b in zip(header,body)]))

"\n".join(...) is going to join the things you give it with newline characters. That’s great. But what you’re giving it is the output of a+b for a,b in zip(header, body). a+b is concatenating those two strings together, without a newline in between them. That’s why you have the two stiched together. You could use string formatting here: f"{a}\n{b}" for a,b ...

But a larger problem is that you’re fully creating this fasta file in memory before you write it to disk. Actually, you’re doing it twice! First you read it into two lists, and then create a massive string to write to the file. You don’t want to do any of that–a genome file is huge, it’ll use gigabytes of memory to hold it.

Your fasta_reader function is a generator–it gives you one entry at a time. That’s nice because it only needs to hold one entry in memory at a time, which is much smaller than the whole genome^[1].

You can condense a lot of this into one pass through the fasta file, and store nothing in memory:


# open the output file for writing first
with open("path/to/grch37.fasta", "w") as f:
    for entry in fasta_reader("hg37.fasta"):
        # use print, it adds a newline itself 
        print(f">{entry.id}", file=f). # using f-string formatting
        print(entry.seq, file=f)

Okay, chr1 is pretty big on its own, but lets ignore that for now. edit: it’s less than 10%, at least ↩︎

MRAB · April 18, 2025, 4:00pm

If a is ">chr_header1" and b is "GGACTTGAAAC...", then a + b is ">chr_header1GGACTTGAAAC...", which is what you’re getting.

You need to put a "\n" between them, i.e. a + "\n" + b.

As you’re joining the results with "\n".join(...), the final one won’t end in a line break.

You’ll get, say, ">chr_header1\nGGACTTGAAAC...\n>chr_header2\nAACAATGCTGT...". Note the lack of a "\n" at the end.

Does that matter?

Instead of building a long string containing everything so that you can write it all out in one go, you could do this:

with open("/path/to/grch37.fasta", "w") as f:
    f.writelines(f">{h}\n{b}\n" for h, b in zip(head, body))

jamestwebber · April 18, 2025, 4:00pm

You can open the file up in a basic text editor (like nano, or emacs) and go to the end of each line–it won’t do line wrapping, so if there is a newline, you’ll see it.

Matteo · April 18, 2025, 4:06pm

@jamestwebber thanks I really appreciate the feedback, I’ll edit as suggested. What I did was an attempt to handle the body and the header for each chromosome independently since I will need to randomly add string sequences in chromosomes bodies and then patch them back together, for which I already have a script ready which might need some readapting

Matteo · April 18, 2025, 4:06pm

@MRAB really appreciated another alternative approach!

jamestwebber · April 18, 2025, 4:11pm

It’ll probably be easier to do things in memory while you’re developing, but the generator pattern (including the version @MRAB posted) is a nice thing to keep in mind when you’ve got things working. It’ll be a lot nicer to run.

To go back to the newline comment–if you aren’t using one already, definitely look into using a nice IDE (VSCode and PyCharm are both free to use) or at least a decent technical text editor (there are lots of options for this: BBEdit, CotEditor, others) that well let you view text files including special characters. This makes it a lot easier to debug output when you’re not sure about what characters are being written.

Matteo · April 18, 2025, 4:14pm

@jamestwebber Indeed, I’m familiar with VS code which is lightweight and handy, but had to swap to PyCharm for performance issues e.g. VS Code sometimes crashes on big input, as you hinted.