Question: High quality de novo transcriptome assembly rely on merging multiple assembly?
gravatar for seta
5.3 years ago by
seta1.3k wrote:

Dear all,

Please let me know your experience regarding combining multiple assembly (derived from different k-mer or different programs) to make the best de novo transcriptome assembly and subsequently having the high-quality annotation?. I've done de novo assembly using several k-mer by CLC on about 400 million illumina reads (100 PE) (10 type of assembly), and I'm going to try trinity, too and finally combine these multiple assembly to have the highly informative one for a non-model organism, which has little information in public databases. Also, it will be great if you mention the perfect tool in your view to combine assembly? Any suggestion and comments would be highly appreciated.

rna-seq next-gen assembly • 3.8k views
ADD COMMENTlink modified 5.3 years ago by dago2.6k • written 5.3 years ago by seta1.3k

Hi seta, thank you for your question on transcriptome assembly tools and pipelines. May I suggest that you add some more information about the actual data and experimental settings in order to add some 'flesh' to your question?

In bioinformatics or science in general, there is often not "the best" or "optimal" or "perfect" tool for a (weakly defined) task. Instead, the optimal tools depend on a lot of factors including the data, experimental question, computational complexity and constraints, and the like. Asking for the best tool without context therefore is pointless and unscientific in my understanding, it might lead to flame wars and subjective discussions, and quickly go out of focus. 

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Michael Dondrup47k

Thanks Michael to correct me! I added some information. That's right, we have not the best or perfect at all, but someone may find a program or tool is much efficient as compared to other. 

ADD REPLYlink written 5.3 years ago by seta1.3k
gravatar for rtliu
5.3 years ago by
New Zealand
rtliu2.1k wrote:

See similar post: merged transcripts from RNA de novo assembly to create a reference transcriptome

Corset makes use of both the sequence similarity and expression data available to cluster contigs, that is why Corset does a better job than CD-HIT-EST.

ADD COMMENTlink written 5.3 years ago by rtliu2.1k
gravatar for Vivek Todur
5.3 years ago by
Vivek Todur50
Vivek Todur50 wrote:

Hi Seta, CD-HIT-EST clustering is the best in class for merging multiple transcriptome assembly. It simply keeps only larger sequences and removes the partial/subset/smaller sequences. There are lot parameters to play with, I would recommend you to use only -s (the shorter sequences needs to be at least XX% length of the representative of the cluster). Hope this will help.


ADD COMMENTlink written 5.3 years ago by Vivek Todur50
gravatar for Brian Bushnell
5.3 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

I wrote a tool, Dedupe, for merging multiple assemblies and removing redundant contigs. It's designed specifically for this purpose, with various options for controlling which sequences are considered duplicates based on contig overlap length, number of substitutions, edits, and so forth. The basic usage is like this: in=assembly1.fa,assembly2.fa,assembly3.fa out=merged.fa

...which will just eliminate all exact duplicate or fully-contained subsequences. You can get complete usage information by running with no arguments.

ADD COMMENTlink modified 7 months ago by RamRS27k • written 5.3 years ago by Brian Bushnell17k

Thanks for all comments. Dear Brian, could you please let me share a paper or document that is explained the "Dedupe "tool, in detail. Yeah, It's great for my purpose, but it's better to know how it's test. Honestly, I found the "evidentialgene: tr2aacds" for merging transcriptome assembly and some valid papers that used it, however one of users mentioned that some parameters, like N50, the CEGMA analysis results, and the percentage of mapped back reads for transcripts resulting from evidentialgene: tr2aacds" tool was significantly unsatisfactory as compared with the individual assembly. So, I'm totally a bit in a doubt about merging assemblies. I would highly appreciated for hearing experience from all users that find the merging several assemblies is a good idea or not, please let us know your findings, in detail? Thanks a lot for your consideration

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by seta1.3k

Dedupe has been used in production on all metagenomic assemblies at JGI for about a year.  If you want to merge multiple assemblies of the same data, I highly recommend it; it will only remove redundant contigs, not do any assembly and not try to combine different contigs.  So, if a read mapped to the un-merged assembly, it will still map to the merged assembly just as well, since no unique sequences are removed or altered (with the default settings).

That said, whether or it is a good idea to generate and merge multiple assemblies is a different question.  The approach will generally lead to some redundancy that Dedupe won't remove (because neither sequence fully contains the other, or they don't match perfectly) so you'll end up with a larger-than-expected assembly.

ADD REPLYlink written 5.3 years ago by Brian Bushnell17k

Thanks Brian for clarification. I'm working on plant transcriptome assembly, so I don't think the Dedupe is suitable for me as I need a tool generate a final assembly after removing identical contigs. Still, waiting for any suggestion and experience from all users.  

ADD REPLYlink written 5.3 years ago by seta1.3k

You can run Dedupe in a mode that will only remove identical contigs, just by adding the flag "ac=f" which turns off looking for contained subsequences.

ADD REPLYlink written 5.3 years ago by Brian Bushnell17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2068 users visited in the last hour