Question: What may be causing super high duplication results in BUSCO?
gravatar for guillermo.ponz.segrelles
2.5 years ago by

Hi everybody,

I'm working on some transcriptomes from non-model organisms coming from Illumina sequences and I'm facing a problem I have't encountered before. To make it short, my data comes from sequencing 3 samples, each one consisting on a pool of 5 entire specimens, in an Illumina sequencer. I checked the quality in FastQC and trimmed with Trimmomatic acordingly. After that, I concatenated the resulting files to make a single assembly of the ~100M reads. Then, I did a standard Trinity assembly (without in-silico normalization). Here starts the strange part:

The assembly resulted in 526860 transcripts (isoform-level) with an N50 of 858, and a median contig length of 377. In addition (and this is what really makes me worry), I run BUSCO to asses completeness and I got the following result: C:98.6%[S:18.0%,D:80.6%],F:1.3%,M:0.1%,n:978.

This duplication level is ridiculously high, but I don't really know what is causing this. I've check the BUSCO documentation and both Biostars and SEQanswers but I haven't found duplications levels like this in a transcriptome. Have you have any similar experience? do you have any suggestion to make this numbers go down?

I'm stuck with this and would really appreciate any help.


ADD COMMENTlink modified 2.5 years ago by gilbert.bionet130 • written 2.5 years ago by guillermo.ponz.segrelles0
gravatar for h.mon
2.5 years ago by
h.mon29k wrote:

Check the Trinity FAQ, you will see your "strange" results are not strange at all.

In addition, the level of duplication you are seeing on BUSCO results is probably due to the several isoforms of the same gene assembled by Trinity. There are some ways of reducing this redundancy:

1) you may select longest isoform or most expressed isoform (check Trinity wiki for how)

2) cluster with CDHIT (usually done at 95% similarity)

3) cluster with iAssembler or TGICL

4) try the new SuperTranscripts method

ADD COMMENTlink written 2.5 years ago by h.mon29k


Thank you very much for your suggestions. I don't know how can I have missed this in the FAQ of Trinity...

Now I'm trying with 2 approaches: 1. Follow Trinity FAQ advice and leave all transcripts there for downstream analysis. 2. Use CD-HIT (which has reduced the D value of BUSCO to 46% without any decrease in the C value) and the select the most expressed isoform per Trinity 'gene' as you suggested.

I will do both analysis for comparison and update here what happens, in case someone finds it useful.

ADD REPLYlink written 2.4 years ago by guillermo.ponz.segrelles0
gravatar for gilbert.bionet
2.5 years ago by
gilbert.bionet130 wrote:

EvidentialGene does want you ask about .. turns a transcript over-assembly into a classified gene set, with primary and alternate transcripts, and removes redundant copies. You can then measure only primaries with BUSCO to avoid that duplicate problem from alternate isoform counts. As you assembled from several individuals, you may well have allelic heterozygote transcripts, another form of redundancy. Your Trinity-only assembly can be improved by using Velvet/Oases, SOAP-trans, and/or idba-trans assemblers, which do high-kmer assemblies that produce more complete genes, fewer fragments, especially for complex genes. EvidentialGene will then reduce that over-assembly from many into the most accurate set of genes.

See also BioStars: "EvidentialGene reduces redundancy of de novo transcriptome assembly? "

ADD COMMENTlink written 2.5 years ago by gilbert.bionet130
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2182 users visited in the last hour