Question: De Novo Metatranscriptomic Assembly Failing - Trinity, Velvet/Oases
gravatar for Newvin
8.4 years ago by
Newvin340 wrote:

I'm attempting de-novo assembly of metatranscriptomic data, which is admittedly a very resource-intensive problem. I have ~206 million paired-end Illumina reads each 100bp long generated via RNA-seq on environmental samples. I am able to create assemblies using Trinity and Velvet/Oases using a small portion of the reads; however, when I attempt to assemble the metatranscriptome using the full set of reads, both programs will run for a day or so then fail while attempting to allocate memory. The server I am running on has 32 procs and 256GB of RAM. I should also mention that for Velvet/Oases, I am using K=61. I believe Trinity's K value is locked at 25.

I am rather new at this. Does anyone have of sense of how unreasonable my parameters are? Is the idea of assembling 200 million reads ludicrous? I may be able to perform a dereplication step that would reduce the number of reads to ~50 million. Does anyone have an assembly experience indicating that I might have more success with only 50 million reads?


ADD COMMENTlink modified 6.4 years ago by Dgg3260 • written 8.4 years ago by Newvin340
gravatar for Jeremy Leipzig
8.4 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

You might need to speak to Titus Brown, who has used Bloom filters to put metagenomic (perhaps not metatranscriptomic) reads into manageable piles.

ADD COMMENTlink modified 6 months ago by RamRS26k • written 8.4 years ago by Jeremy Leipzig19k

I'd be interested in your results with using digital normalization, I think it might work better for metatranscriptomic data than partitioning will.

ADD REPLYlink modified 6 months ago by RamRS26k • written 8.0 years ago by Titus Brown80
gravatar for pmenzel
8.4 years ago by
pmenzel310 wrote:

Assembly of that many reads is not unreasonable. Try SOAPdenovo for the assembly. If you filter out low abundance k-mers (e.g. with -d option of SOAPdenovo), the memory consumption would decrease.

ADD COMMENTlink written 8.4 years ago by pmenzel310
gravatar for Dgg32
6.4 years ago by
Dgg3260 wrote:

I would cluster the reads with cdhit with a high identity cutoff and put the amount of reads into the fasta headers so I can keep track of them. This step alone cuts my sequences into a half without losing a single reads (but it surely mask some heterogenity of your sequences). Then Velvet with default settings will finish the job.

ADD COMMENTlink written 6.4 years ago by Dgg3260
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1548 users visited in the last hour