Time consuming to handle an input file of size 2.7GB?

bsn · January 8, 2021, 3:07am

Hi. I have an input .txt file of size 2.7GB and a simple piece of code to process it as the following. Here, my goal is to create two different files based on this input file, with names inputSeqs.fasta and inputMeta.tsv. The thing is, it takes a rather long time to run.

with open(“./ena_embl-covid19_20210107-0806.txt”) as f:
line = f.readline()
while line != “”:
if line[:2] == “ID”:
strainNameIDLst.append(line[2:].strip()[:8])
elif line[:2] == “DT”:
samplingDateLst.append(line[2:].strip()[:11])
elif line[:2] == “SQ”:
lineLst = line.split(" “)
tmpLst =
for item in lineLst:
if item.isdigit():
tmpLst.append(int(item))
seqLengthLst.append(tmpLst[0])
aCountsLst.append(tmpLst[1])
cCountsLst.append(tmpLst[2])
gCountsLst.append(tmpLst[3])
tCountsLst.append(tmpLst[4])
crrtSeq = “”
line = f.readline().strip()
while line !=”//“: # need to test this point here
crrtSeq += line
line = f.readline().strip()
seqsLst.append(crrtSeq)
elif line[:2] == “DE”:
line = f.readline()
countryLst.append(line[2:].strip().split(”/“)[2])
hostLst.append(line[2:].strip().split(”/")[1])

I was wondering, is there a better way to handle it? Many thanks.

bsn · January 8, 2021, 3:47am

After running the above code, the process finally returns, as the following message indicates.

Process returned -9 (0x-9) execution time : 1747.012 s
Close this window to continue…
(base) Lucys-MacBook-Pro:~ lucy$

So, what is the meaning of it?

jeanas · January 8, 2021, 4:13pm

It is normal for the processing of such a large file to take time. Without complete code, it’s hard to tell how to do better (and the indentation is mangled, not sure why).

You might want to try running your code with PyPy or numba.

steven.daprano · January 9, 2021, 1:31am

Hi bsn,

(By the way, do you have a name you would rather be referred to instead
of “bsn”?)

You say “it takes a rather long time to run”, but we don’t know what you
consider a long time to process a 2.7GB file. A minute? Five minutes? An
hour? A day?

I expect that processing time here is dominated by the speed of your
file I/O. If you are reading a file on a network drive, the speed you
can read it is limited by the slower of the network speed and the
storage speed at the back end.

Worst case: you are reading from a really slow DVD-ROM drive on a
network drive over slow and unreliable wireless. Best case, a really
fast SSD on the local machine.

If you have the option, move the file to the fastest local drive on the
fastest computer you have, and run the code on that computer.

As far as your code goes, it mostly looks reasonable to me, except for
one part that could be excessively slow. This part:

crrtSeq = ""
line = f.readline().strip()
while line !="//":       # need to test this point here
    crrtSeq += line
    line = f.readline().strip()
seqsLst.append(crrtSeq)

Repeated string concatenation like that can sometimes be very slow. It’s
hard to tell whether this is a problem for you, but just in case, I
would re-write it to this:

crrtSeq = []
line = f.readline().strip()
while line !="//":       # need to test this point here
    crrtSeq.append(line)
    line = f.readline().strip()
seqsLst.append(''.join(crrtSeq))

This may or may not make any difference, but it shouldn’t be slower and
it might be faster.

bsn · January 11, 2021, 2:56am

Thank you for your kind reply. You may refer me by Lucy by the way.

steven.daprano · January 11, 2021, 9:10am

Hi Lucy,

Did you try the changes I suggested, and did they make a difference?

Regards,

Steve

bsn · January 12, 2021, 9:22am

Hi, I tried what you suggested and it worked well. My file is of size 2.8GB, and it runned for like 28.95 miniutes. Not very amazing currently.

steven.daprano · January 12, 2021, 9:51am

Can you run this simple script over your file? This will tell you
roughly how much of that 30 minutes is being taken from reading the
file.

import time
filename = "<path to your file goes here>"
t = time.time()
with open(filename, 'r') as f:
    for count, line in enumerate(f, 1):
        pass
t = time.time() - t
print("Lines:", count)
print("Minutes:", t/60)

Short of moving the file to a faster disk, that’s pretty much the
minimum time it takes to read the file. There’s not really anything you
can do to speed that up except to get a faster disk or a smaller file.

Depending on the difference between that time and the 30 minutes you are
getting, it might be worth spending some more effort on improving your
code.

bsn · January 12, 2021, 9:56am

Lines: 39879168 Minutes: 0.11047803163528443

[Finished in 6.677s]

steven.daprano · January 12, 2021, 10:22am

Well, that was certainly faster than I expected. Six seconds to read a 3
GB file. Are you using an SSD?

I’ll have another look at your file processing code when I get a chance.