Question

Transcripome Data Analysis Using Cd-Hits

2

Entering edit mode

12.4 years ago

Kiriya ▴ 100

We have sequenced a transcriptome of a species who does not have a sequenced genome, using 454 and our initial goal is to find a set of ESTs that represent genes. The 454 reads were assembled using Newbler 2.5 and the initial assembly gave ~26000 isotigs and 18,000. Contigs. After talking to the several people, I used CD-Hits program to combine the isotigs, contigs and singltons that were not assembled. After combining these sequences, I got ~4000 isotigs, ~17,000 contigs and ~30, 000 Singlton that were not assembled. Is this the correct way to do this? I couldn’t find any publication that has mentioned this method.

transcriptome gene • 4.9k views

ADD COMMENT • link updated 7.1 years ago by njtulsani ▴ 60 • written 12.4 years ago by Kiriya ▴ 100

0

Entering edit mode

Which identity thresholds did you use?

ADD REPLY • link 11.5 years ago by Yannick Wurm ★ 2.5k

0

Entering edit mode

Algorithms for CD-HIT were described in three papers published in Bioinformatics.

Clustering of highly homologous sequences to reduce the size of large protein databases. Weizhong Li, Lukasz Jaroszewski & Adam Godzik. Bioinformatics (2001) 17:282-283, PDF, Pubmed
Tolerating some redundancy significantly speeds up clustering of large protein databases. Weizhong Li, Lukasz Jaroszewski & Adam Godzik. Bioinformatics (2002) 18: 77-82, PDF, Pubmed
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Weizhong Li & Adam Godzik. Bioinformatics (2006) 22:1658-1659 PDF, Pubmed
CD-HIT: accelerated for clustering the next generation sequencing data. Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu and Weizhong Li, Bioinformatics (2012) 28:3150-3152, doi: 10.1093/bioinformatics/bts565 PDF

Please check these papers about CD HIT

ADD REPLY • link 7.1 years ago by njtulsani ▴ 60

score 1 · Answer 1 · 2011-11-20

1

Entering edit mode

12.4 years ago

Damian Kao 16k

Why do you want to further cluster the reads? Isotigs are grouped into isogroups in Newbler. Think of isogroup as the gene and isotigs as the alternate splice forms. Isotigs are made from contigs in Newbler, so you don't need to cluster the isotigs with the contigs.

If you feel you can get more data out of the unassembled reads, you can try cdhit + cap3 the unassembled reads with the isotigs.

ADD COMMENT • link 12.4 years ago by Damian Kao 16k

0

Entering edit mode

Thanks DK for your answer. For the annotation, should I keep the longest Isotig from each Isogroup?

ADD REPLY • link 12.4 years ago by Kiriya ▴ 100

0

Entering edit mode

That's more tricky. Longest isotig doesn't always mean the most inclusive transcript. I would just report all the possible isoforms.

ADD REPLY • link 12.4 years ago by Damian Kao 16k

score 1 · Answer 2 · 2011-11-23

I got recommended the same method from 454. Some isotigs from the same isogroup are very similar, with just a few bases difference due to the heterozygotic nature of the sample(s) sequenced. The use of CDHit will allow for clustering these transcripts (isotigs). So, after clustering, you should again look how many transcripts (isotigs) there are for each isogroup. In principle, isotigs should cluster with isotigs from the same isogroup only. After clustering, remaining isotigs from the same isogroup could very well represent real isoforms.

score 0 · Answer 3 · 2017-04-03

Applicatioins can be found from the papers that cited CD-HIT (external links to Google Scholar): Li et al (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Li et al (2001) Clustering of highly homologous sequences to reduce the size of large protein database. Li et al (2002) Tolerating some redundancy significantly speeds up clustering of large protein databases. Huang et al (2010) CD-HIT Suite: a web server for clustering and comparing biological sequences. Niu et al (2009) Artificial and natural duplicates in pyrosequencing reads of metagenomic data