Time consuming to handle an input file of size 2.7GB?

Hi. I have an input .txt file of size 2.7GB and a simple piece of code to process it as the following. Here, my goal is to create two different files based on this input file, with names inputSeqs.fasta and inputMeta.tsv. The thing is, it takes a rather long time to run.

with open("./ena_embl-covid19_20210107-0806.txt") as f:
line = f.readline()
while line != “”:
if line[:2] == “ID”:
strainNameIDLst.append(line[2:].strip()[:8])
elif line[:2] == “DT”:
samplingDateLst.append(line[2:].strip()[:11])
elif line[:2] == “SQ”:
lineLst = line.split(" “)
tmpLst =
for item in lineLst:
if item.isdigit():
tmpLst.append(int(item))
seqLengthLst.append(tmpLst[0])
aCountsLst.append(tmpLst[1])
cCountsLst.append(tmpLst[2])
gCountsLst.append(tmpLst[3])
tCountsLst.append(tmpLst[4])
crrtSeq = “”
line = f.readline().strip()
while line !=”//": # need to test this point here
crrtSeq += line
line = f.readline().strip()
seqsLst.append(crrtSeq)
elif line[:2] == “DE”:
line = f.readline()
countryLst.append(line[2:].strip().split("/")[2])
hostLst.append(line[2:].strip().split("/")[1])

I was wondering, is there a better way to handle it? Many thanks.

After running the above code, the process finally returns, as the following message indicates.

Process returned -9 (0x-9) execution time : 1747.012 s
Close this window to continue…
(base) Lucys-MacBook-Pro:~ lucy$

So, what is the meaning of it?

It is normal for the processing of such a large file to take time. Without complete code, it’s hard to tell how to do better (and the indentation is mangled, not sure why).

You might want to try running your code with PyPy or numba.

1 Like

Hi bsn,

(By the way, do you have a name you would rather be referred to instead
of “bsn”?)

You say “it takes a rather long time to run”, but we don’t know what you
consider a long time to process a 2.7GB file. A minute? Five minutes? An
hour? A day?

I expect that processing time here is dominated by the speed of your
file I/O. If you are reading a file on a network drive, the speed you
can read it is limited by the slower of the network speed and the
storage speed at the back end.

Worst case: you are reading from a really slow DVD-ROM drive on a
network drive over slow and unreliable wireless. Best case, a really
fast SSD on the local machine.

If you have the option, move the file to the fastest local drive on the
fastest computer you have, and run the code on that computer.

As far as your code goes, it mostly looks reasonable to me, except for
one part that could be excessively slow. This part:

crrtSeq = ""
line = f.readline().strip()
while line !="//":       # need to test this point here
    crrtSeq += line
    line = f.readline().strip()
seqsLst.append(crrtSeq)

Repeated string concatenation like that can sometimes be very slow. It’s
hard to tell whether this is a problem for you, but just in case, I
would re-write it to this:

crrtSeq = []
line = f.readline().strip()
while line !="//":       # need to test this point here
    crrtSeq.append(line)
    line = f.readline().strip()
seqsLst.append(''.join(crrtSeq))

This may or may not make any difference, but it shouldn’t be slower and
it might be faster.

1 Like

Thank you for your kind reply. You may refer me by Lucy by the way.

Hi Lucy,

Did you try the changes I suggested, and did they make a difference?

Regards,

Steve

1 Like

Hi, I tried what you suggested and it worked well. My file is of size 2.8GB, and it runned for like 28.95 miniutes. Not very amazing currently. :wink:

Can you run this simple script over your file? This will tell you
roughly how much of that 30 minutes is being taken from reading the
file.

import time
filename = "<path to your file goes here>"
t = time.time()
with open(filename, 'r') as f:
    for count, line in enumerate(f, 1):
        pass
t = time.time() - t
print("Lines:", count)
print("Minutes:", t/60)

Short of moving the file to a faster disk, that’s pretty much the
minimum time it takes to read the file. There’s not really anything you
can do to speed that up except to get a faster disk or a smaller file.

Depending on the difference between that time and the 30 minutes you are
getting, it might be worth spending some more effort on improving your
code.

1 Like

Lines: 39879168 Minutes: 0.11047803163528443

[Finished in 6.677s]

Well, that was certainly faster than I expected. Six seconds to read a 3
GB file. Are you using an SSD?

I’ll have another look at your file processing code when I get a chance.