Hi guys I have 32GB of memory ram in my pc and I am trying to count all the aac from a compressed fasta file of 67GB. Of course I tried to use all the tools I know, biopython, generators, etc and I have still inefficient results, my pc almost stops and I don't have access to any type of cloud environment. I am working independently in this task. I didn't try any linux stuff because I am trying to use python. I was trying to make a long string from the file and then count the aacs.
def element_count(filename): """Returns a dicitonary Counter with the counting if the aac that compose the sequences from a gzip fasta file. The aacs as keys and counts as values.""" with gzip.open(filename, "rt") as handle: aac_count = Counter() seq = ''.join(str(rec.seq) for rec in SeqIO.parse(handle, "fasta")) aac_count.update(seq) yield aac_count
There are any way to improve this or any available free tool I can use to facilitate my task? Any way I am very thankful for your support and help. I always learn a lot here and in the stackoverflow.
You are almost there:
OK I will try that. Thanks massa.
Do you have to operate on the whole compressed multifasta at once? It would be slower, but you could decompress the fasta first, split it in to chunks or individual records (you may even be able to split the archive) and just update a dictionary held in memory iteratively. This would help you get around your memory issue.
Joe I will try that. I will work in that code. Thanks
I think that the modified code does just that - iteratively (sequence-wise) update the dict (
Counter) object. Without the need to hold the decompressed data on the disc. But I could be wrong.
I think it should yeah, but I'm not sure how the implementation of
gzipwill play with
parseis the way to go, and use the iterator, but having the compressed file
gzip.open'd might change things. I don't do much with compressed formats in python so I'm not certain!
Are you certain that
parseis efficient enough? I would imagine that parsing the data into records is quite some overhead.
SeqRecordobject is not all that huge in the grand scheme of things, and as its an iterator, there should only ever be one instance happening at any one time.
That said, bio(python) is neither fast, nor particularly frugal with memory, so a more robust solution might be required, depending on how fast schlogl needs it to be.
Well I read some stuff about biopython and the only thing I didn't use yet was the SimpleFastaParser. I really don't know if it would speed up thinks. But I am sure I tried all the basic stuff they discuss in their cookbook to make it be efficient.
Including using gzip.open to avoid to decompress the file. What I read was that SeqIO parser is slow because it takes care of a lot of things, because this they suggest SimpleFastaParser.
I will check it out and then come back here with some information.
See you soon.
"""The “catch” is that you have to work with SeqRecord objects (see Chapter 4), which contain a Seq object (see Chapter 3) plus annotation like an identifier and description. Note that when dealing with very large FASTA or FASTQ files, the overhead of working with all these objects can make scripts too slow. In this case consider the low-level SimpleFastaParser and FastqGeneralIterator parsers which return just a tuple of strings for each record (see Section 5.6). http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:low-level-fasta-fastq """ It took 2-3 hr to get the job done (count the aac). I will try the simplefastaparser tosee ifit couldimprove the script.
Thank you massa and Joe for the help and interest.
This is exactly what I meant (even if one record is not a lot by itself, it is the quantity that kills the cat).
My point wasn't about speed. OP didn't suggest that the process was too slow, rather that it was failing to finish altogether. My point was about an approach to minimise the memory usage.
I'm going to book a course on reading comprehension now... ><;
Now I am trying this:
Now I can use the pc for read, search in the net, even download, but the time for finish the job is loooooong...8)
Humm...mybe if I change the list for a generator? would be better???
You're asking about kmers now, but originally just cared about counting residues. Enumerating kmers is considerably slower than the task your first described, so can you clarify what the actual objective is, because this is confusing the code needed now,
I am tryig to using this approach in an abrangent study with a database protein data. Each one will be a part of the global job. I will use different codes for each partof the study. count aac, kmers, etc...
You're trying to solve/ask too many things at once.
At the moment your code cannot even finish, and I suspect this has something to do with your fasta being larger than the RAM you have available, not the processing speed.
Once that is solved, approaches to tackle the runtime can be considered, but there's no universe where this will be quick, particularly in python.
I would advise taking a small subset of your data, even if it's a large chunk, say 10GB worth, and ensure the code is even functional/capable of completing at all first.
Hi Joe. I have got the counting done. And now the next step is to enumerate the motifs and for sure I know its a ineficient task. I will try to use some different tools, like jellyfish... Thank you for your help and tips. Hope i can count on you to help me with other problems on the way.
Jellyfish would be a good call.
If you want to try python, I have the following code which can figure out a given kmer set for all human data in about 20 minutes. Its reasonably fast per-record. Just bear in mind that the human genome is still only a ~3GB file, so assuming it could scale linearly, you're still looking at
20 min * (67GB/3GB) = 446 min ~= 7.5 hrs.
These numbers will change radically depending on the size of
kand somewhat on
n. For your use case, you won't be interested in
nrandom kmers, but the
kseqsvariable should contain all kmers, which you can hand to
Counter()to enumerate them.
Full code and output here.
Hey Joe thanks again for the help. I don't mind it takes a long time to finish a job, since it could be done in my pc without stop it. Like I said with the change in the code I got the pc running the aac counting and still could use it to read pdfs, search in the net, etc...It nor frozen et all.
I will check it out the code you designed and also look for jelly... Thank you for your time, and I really appreciate you tips and help. You always came up with something nice. Thanks again.
PS - I will try this https://biopython.org/wiki/Split_large_file + gzip and then your code to find all kmers in the files. Ah! could you explain for me that regex?