How can I count aminoacid residues from a HUGE compressed fasta file?
0
0
Entering edit mode
3.2 years ago
schlogl ▴ 130

Hi guys I have 32GB of memory ram in my pc and I am trying to count all the aac from a compressed fasta file of 67GB. Of course I tried to use all the tools I know, biopython, generators, etc and I have still inefficient results, my pc almost stops and I don't have access to any type of cloud environment. I am working independently in this task. I didn't try any linux stuff because I am trying to use python. I was trying to make a long string from the file and then count the aacs.

def element_count(filename):
"""Returns a dicitonary Counter with the counting if the aac that
compose the sequences from a gzip fasta file. The aacs as keys and
counts as values."""
with gzip.open(filename, "rt") as handle:
aac_count = Counter()
seq = ''.join(str(rec.seq) for rec in SeqIO.parse(handle, "fasta"))
aac_count.update(seq)
yield aac_count


There are any way to improve this or any available free tool I can use to facilitate my task? Any way I am very thankful for your support and help. I always learn a lot here and in the stackoverflow.

Thank you

Paulo

sequence • 1.1k views
1
Entering edit mode

You are almost there:

def element_count(filename):
"""Returns a dicitonary Counter with the counting if the aac that
compose the sequences from a gzip fasta file. The aacs as keys and
counts as values."""
with gzip.open(filename, "rt") as handle:
aac_count = Counter()
for rec in SeqIO.parse(handle, "fasta"):
aac_count.update(str(rec.seq))

return aac_count

0
Entering edit mode

OK I will try that. Thanks massa.

1
Entering edit mode

Do you have to operate on the whole compressed multifasta at once? It would be slower, but you could decompress the fasta first, split it in to chunks or individual records (you may even be able to split the archive) and just update a dictionary held in memory iteratively. This would help you get around your memory issue.

0
Entering edit mode

Joe I will try that. I will work in that code. Thanks

0
Entering edit mode

I think that the modified code does just that - iteratively (sequence-wise) update the dict (Counter) object. Without the need to hold the decompressed data on the disc. But I could be wrong.

0
Entering edit mode

I think it should yeah, but I'm not sure how the implementation of gzip will play with SeqIO.parse. Certainly parse is the way to go, and use the iterator, but having the compressed file gzip.open'd might change things. I don't do much with compressed formats in python so I'm not certain!

0
Entering edit mode

Are you certain that parse is efficient enough? I would imagine that parsing the data into records is quite some overhead.

0
Entering edit mode

A SeqRecord object is not all that huge in the grand scheme of things, and as its an iterator, there should only ever be one instance happening at any one time.

That said, bio(python) is neither fast, nor particularly frugal with memory, so a more robust solution might be required, depending on how fast schlogl needs it to be.

0
Entering edit mode

Well I read some stuff about biopython and the only thing I didn't use yet was the SimpleFastaParser. I really don't know if it would speed up thinks. But I am sure I tried all the basic stuff they discuss in their cookbook to make it be efficient.

Including using gzip.open to avoid to decompress the file. What I read was that SeqIO parser is slow because it takes care of a lot of things, because this they suggest SimpleFastaParser.

I will check it out and then come back here with some information.

See you soon.

0
Entering edit mode

"""The “catch” is that you have to work with SeqRecord objects (see Chapter 4), which contain a Seq object (see Chapter 3) plus annotation like an identifier and description. Note that when dealing with very large FASTA or FASTQ files, the overhead of working with all these objects can make scripts too slow. In this case consider the low-level SimpleFastaParser and FastqGeneralIterator parsers which return just a tuple of strings for each record (see Section 5.6). http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:low-level-fasta-fastq """ It took 2-3 hr to get the job done (count the aac). I will try the simplefastaparser tosee ifit couldimprove the script.

Thank you massa and Joe for the help and interest.

I

0
Entering edit mode

which contain a Seq object (see Chapter 3) plus annotation like an identifier and description. Note that when dealing with very large FASTA or FASTQ files, the overhead of working with all these objects can make scripts too slow.

This is exactly what I meant (even if one record is not a lot by itself, it is the quantity that kills the cat).

0
Entering edit mode

My point wasn't about speed. OP didn't suggest that the process was too slow, rather that it was failing to finish altogether. My point was about an approach to minimise the memory usage.

0
Entering edit mode

I'm going to book a course on reading comprehension now... ><;

0
Entering edit mode

Now I am trying this:

def get_all_non_overlap_sequence_kmers(sequence, k):
return [sequence[i:i+k] for i in range(0, len(sequence) - k + 1, k)]

with gzip.open("data/uniparc_active.fasta.gz", 'rt') as in_handle:
counter_kmers = Counter()
for name, seq in SimpleFastaParser(in_handle):
counter_kmers.update(get_all_non_overlap_sequence_kmers(seq, 5))


Now I can use the pc for read, search in the net, even download, but the time for finish the job is loooooong...8)

Humm...mybe if I change the list for a generator? would be better???

Paulo

2
Entering edit mode

You're asking about kmers now, but originally just cared about counting residues. Enumerating kmers is considerably slower than the task your first described, so can you clarify what the actual objective is, because this is confusing the code needed now,

0
Entering edit mode

I am tryig to using this approach in an abrangent study with a database protein data. Each one will be a part of the global job. I will use different codes for each partof the study. count aac, kmers, etc...

0
Entering edit mode

You're trying to solve/ask too many things at once.

At the moment your code cannot even finish, and I suspect this has something to do with your fasta being larger than the RAM you have available, not the processing speed.

Once that is solved, approaches to tackle the runtime can be considered, but there's no universe where this will be quick, particularly in python.

I would advise taking a small subset of your data, even if it's a large chunk, say 10GB worth, and ensure the code is even functional/capable of completing at all first.

Premature optimisation is the root of all evil

0
Entering edit mode

Hi Joe. I have got the counting done. And now the next step is to enumerate the motifs and for sure I know its a ineficient task. I will try to use some different tools, like jellyfish... Thank you for your help and tips. Hope i can count on you to help me with other problems on the way.

0
Entering edit mode

Jellyfish would be a good call.

If you want to try python, I have the following code which can figure out a given kmer set for all human data in about 20 minutes. Its reasonably fast per-record. Just bear in mind that the human genome is still only a ~3GB file, so assuming it could scale linearly, you're still looking at 20 min * (67GB/3GB) = 446 min ~= 7.5 hrs.

These numbers will change radically depending on the size of k and somewhat on n. For your use case, you won't be interested in n random kmers, but the kseqs variable should contain all kmers, which you can hand to Counter() to enumerate them.

import re, sys
from Bio import SeqIO
import random
import time

k = 9
n = 10
kmers = re.compile("(?=(\w{%s}))" % k)

def timer(start, end):
hours, rem = divmod(end-start, 3600)
minutes, seconds = divmod(rem, 60)
return hours, minutes, seconds

start = time.time()

for rec in SeqIO.parse(sys.argv[1], 'fasta'):
s = time.time()
kseqs = kmers.findall(str(rec.seq))
randomsel = random.sample(kseqs, n)
e = time.time()
h, m, s = timer(s, e)
print("10 random kmers from {} [Elapsed time: {:0>2}:{:0>2}:{:05.2f}]\n{}".formatrec.id, int(h), int(m), s, str(randomsel)))

end = time.time()
hours, minutes, seconds = timer(start, end)
print("Total elapsed time: {:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))


Full code and output here.

0
Entering edit mode

Hey Joe thanks again for the help. I don't mind it takes a long time to finish a job, since it could be done in my pc without stop it. Like I said with the change in the code I got the pc running the aac counting and still could use it to read pdfs, search in the net, etc...It nor frozen et all.

I will check it out the code you designed and also look for jelly... Thank you for your time, and I really appreciate you tips and help. You always came up with something nice. Thanks again.

PS - I will try this https://biopython.org/wiki/Split_large_file + gzip and then your code to find all kmers in the files. Ah! could you explain for me that regex?

Thanks