Question: EvidentialGene reduces redundancy of de novo transcriptome assembly?
gravatar for crimsontabaq
2.4 years ago by
Russia, Kazan
crimsontabaq40 wrote:


Since trinity output assembly ('original') has had a lot of duplicated matches with BLASTx, we decided to try reducing redundancy with tr2aacds script from EvidentialGene project. tr2aacds filters and merges contigs according to their coding potential and % of identity - sounds more legit than blast2cap3 approach or simple duplicates removal.

To compare original and filtered assemblies, we've done some check-ups with BUSCO and BLASTx. Results are - yes, yielding decrease of duplicates (BUSCO), but also increased number of missing and fragmented contigs. Yet these nr-assemblies are giving some, albeit much less, duplicated BLAST results. We're afraid to lose biologically meaningful data, but redundancy also leads to problems in further analysis.

Does anybody use a tr2aacds to reduce redundancy in de novo assemblies?

ADD COMMENTlink modified 2.4 years ago by gilbert.bionet130 • written 2.4 years ago by crimsontabaq40

Hello crimsontabaq,

I have some questions about how do you use that tool. I have looked for a way to send you a private message but I think that is not possible in this forum. As consequence I have to put my question here (sorry). How do you have applied the EG approach? do you have touched several configuration files or not?

Thank you for your time.

ADD REPLYlink written 2.4 years ago by pablo6199170

Hello Pablo! We've just used one of the Evigene scripts that are supplied in the project data. We've looked through configs and didn't find anything related to our job, so we just fed the needed options to the script itself on the run.

ADD REPLYlink written 2.4 years ago by crimsontabaq40

Thank you for the clarification, I have done the same. In our case we really need to reduce the redundancy of our transcriptome because we have obtained more than 1.000.000 transcripts and CD-hit est didn't help (reduce the dataset but we still had 900k transcripts) for that reason we don't check these effects which you have find. Maybe we had the same issue or maybe not, I'll try to check that but as we have used the same assembler I expect same "problems".

ADD REPLYlink written 2.4 years ago by pablo6199170

Wow, one million. Just a wild guess - have you changed min contig size in Trinity? We've adjusted this value to a minimum sized protein multiplied on 3 of relative species - mb not very right approach, but the resulting assembly is quite ok except some issues I've described earlier.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by crimsontabaq40

No we let the default config, I think it is something like 200 nt of min size. In my opinion your way to do it is fine, but my supervisor its paranoiac about lose biological relevant information, even when we finish the assembly with tha huge amount of high redundant data (and for sure, also a lot of artifacts).

ADD REPLYlink written 2.4 years ago by pablo6199170
gravatar for gilbert.bionet
2.4 years ago by
gilbert.bionet130 wrote:

This "..increased number of missing and fragmented contigs.." could be due to various things, but two I know or surmise from experience are part of your results:

  1. Trinity, and other assemblers, produce joined genes (fusions, chimera), that can be measured as existing/full genes, by BLASTx or whatever BUSCO-software you use, because of the way those measures work. However a transcript made up of two or more gene loci isn't what Evigene considers accurate. You can instead make protein translations of your transcripts (as Evigene does), then measure with BLASTp against reference proteins to count valid proteins. Or else check your BLASTx results for cases of joined genes (before Evigene reduction). 1b. Using several gene assemblers, such as Velvet/Oases, idba_tran, Soap_Trans, with multi-kmer options, will produce a more complete gene set from your RNA, than using Trinity alone (those others resolve gene joins and fragments better for loci where Trinity fails). That is what Evigene was designed for: reducing many gene assemblies to the best coding gene subset.

  2. Some settings for tr2aacds may be changed to return more of the smaller proteins, if those are what are now missing from your reduced transcript set. You can check what genes are missing, and if they are small ones (e.g. 30 to 60 aminos, or smaller), resetting some of tr2aacds minimum protein size settings will recover those. An alternative to that, you can add reference blast scores to tr2aacds for each input transcript to retain those with good reference alignments. As it sounds like you have blast scores already for your input transcripts, make a aablast table of those (trid <tab> refid <tab> blast_bitscore <newline>), then run tr2aacds with that option -ablastab [I think it also will read standard blastp/blastx -outformat 7 tables ].

  • Don Gilbert
ADD COMMENTlink written 2.4 years ago by gilbert.bionet130

Hello and thanks for an expertized comment!

I've got your points for missing prots, but could you comment on excessive annotation which we're trying to reduce? Does tr2aacds really suits for this job? We've set the configs to Evigene script according to our task, but we didn't know that it's possible to feed blast results to it. Thanks!

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by crimsontabaq40

Just to point it out, (I think) he is the author of the tool ;) (congrats Mr. Gilbert).

ADD REPLYlink written 2.4 years ago by pablo6199170
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1130 users visited in the last hour