Question

Is there any alternative for CD_Hit to remove redundancy from asemmbled trinity output file?

1

Entering edit mode

5.8 years ago

rahmati.razieh83 ▴ 30

Hi everyone

I have a problem with reduction of redundancy from trinity output file. I have got an assembled fasta file from trinity containing 302000 contigs showing so much redundancy. I used CD_hit to remove redundancies and get unigenes. After using CD_Hit the number of contigs reduced to 240000 contigs showing lots of redundancies again. CD_Hit was not effective to achieve unigenes. Please give me advise how can I get unigens and remove redundancies?

Thanks

Assembly • 5.5k views

ADD COMMENT • link updated 3.5 years ago by bcontreras ▴ 10 • written 5.8 years ago by rahmati.razieh83 ▴ 30

0

Entering edit mode

Did you tweak the identity threshold using -c on the cd-hit?

ADD REPLY • link 5.8 years ago by Sej Modha 5.3k

0

Entering edit mode

When you say

240000 contigs showing lots of redundancies again

How do you verify that?

Try using TGICL

ADD REPLY • link 5.8 years ago by lakhujanivijay 5.8k

score 1 · Answer 1 · 2018-07-07

Trinity has a somewhat new script to construct "SuperTranscripts" based on the gene-to-isoform relationships and the sequence graph structure leveraged by Trinity during assembly. I think this will result in a better representation of unigenes than using cdhit.

$TRINITY_HOME/Analysis/SuperTranscripts/Trinity_gene_splice_modeler.py \
   --trinity_fasta Trinity.fasta

score 1 · Answer 2 · 2018-07-09

1

Entering edit mode

5.8 years ago

Jake Warner ▴ 830

Getting 'unigenes' from Trinity assemblies is tricky business. I've found that Corset performs better than CD-Hit. Another idea is to BLAST all the transcripts and group them by reciprocal best blast hit.

ADD COMMENT • link 5.8 years ago by Jake Warner ▴ 830

1

Entering edit mode

LACE and Corset are tools from the same group. Initially I thought LACE would be the preferred tool, as it was developed more recently, but I was wrong: according to one of the authors of both tools, they should be equivalent for the purpose of doing gene-level differential expression analysis. As the Trinity Trinity_gene_splice_modeler.py is based on the same algorithm as LACE, it should also be equivalent to Corset.

ADD REPLY • link 5.8 years ago by h.mon 35k

score 1 · Answer 3 · 2020-10-20

We have used our own https://github.com/eead-csic-compbio/get_homologues successfully. In fact we benchmarked against CD-HIT on https://www.frontiersin.org/articles/10.3389/fpls.2017.00184/full

I believe the main problem is that isoforms with different exons or retained introns are not properly handled by CD-HIT, but can clustered correctly with GET_HOMOLOGUES-EST in most cases