Question: memory requirments of velvet tool (de novo assembly)
1
gravatar for wangyang703092
4.3 years ago by
China
wangyang70309290 wrote:
Hi folks, i'm using Velvet to do the genome assembly but problem arose.The species may have approximately 200M genome size with 4 lanes data ,and 35 million 76 bp PE reads,separately(100x).When i run velvet with 31kmer and default parameters, the 64G-RAM server used almost 100% RAM and it was hard to ssh it.So here is the question,how much RAM do server need to run velvet or SOAPdenovo2?
denovo assembly • 2.8k views
ADD COMMENTlink modified 4.3 years ago by rtliu2.0k • written 4.3 years ago by wangyang70309290
1
gravatar for rtliu
4.3 years ago by
rtliu2.0k
New Zealand
rtliu2.0k wrote:

Using the Velvet memory calculator

Memory Usage and Coverage

Memory 32 Gb
Coverage 26.6x

As Brian said, sequencing errors, adapters contaimination, heterozygous genome, etc will all increase the memory requirement.

SOAPdenovo2 will need a lot less RAM than velvet, the most RAM efficient way to run SOAPdenovo2 is to run sparse_pregraph command to construct sparse kmer-graph.

 

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by rtliu2.0k
0
gravatar for Brian Bushnell
4.3 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

The memory requirement tends to increase with the number of unique kmers.  So, the more data, the bigger the genome, and the higher the error rate, the more memory will be needed.

Thus, you can reduce the memory requirements (and often get a better result) by quality-trimming or filtering, contaminant removal (both synthetic and natural, such as human contamination), and adapter-trimming.  After that, you can further decrease memory requirements by error-correction, and by subsampling or normalizing the input data to a much lower level.  And, ultimately, you will probably get a much better assembly with a kmer longer than 31; perhaps around 41-49 with high coverage 76bp reads.  Sometimes it's also useful to split out the ribosomal, mitochondrial, and chloroplast parts of the genome (which may have a much higher coverage than the rest) and assemble them separately; this is often possible by depth-binning.

Sometimes you can see contamination peaks in the insert-size histogram (for synthetic contaminants) or gc histogram (for genomic contaminants).  BLASTing a few thousand reads against nt can often tell you which contaminants may be present.  If your reads are overlapping, you can generate an insert-size histogram with BBMerge and look for very sharp peaks, which are typically synthetic contaminants.

You can do quality trimming, filtering, contaminant removal, adapter trimming, subsampling, and gc histogram generation with BBDuk.  For human removal (or other genomic contamination from large genomes with references) I suggest BBMap instead as it has higher specificity.  After trimming and contaminant removal, you can do error-correction and normalization with BBNorm to reduce coverage and selectively concentrate real genomic kmers; or, subsample.  If you normalize, a target depth of 30x to maybe 60x is probably optimal for Velvet though it depends on the kmer size you use for assembly (bigger kmer needs more coverage) and whether the genome is diploid.

These are all part of BBTools, and each has a shellscript bbduk.sh, bbmap.sh, and bbnorm.sh) which will display usage information.

ADD COMMENTlink written 4.3 years ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 736 users visited in the last hour