Question: Transcripome Data Analysis Using Cd-Hits
2
gravatar for Kiriya
6.3 years ago by
Kiriya100
Kiriya100 wrote:

We have sequenced a transcriptome of a species who does not have a sequenced genome, using 454 and our initial goal is to find a set of ESTs that represent genes. The 454 reads were assembled using Newbler 2.5 and the initial assembly gave ~26000 isotigs and 18,000. Contigs. After talking to the several people, I used CD-Hits program to combine the isotigs, contigs and singltons that were not assembled. After combining these sequences, I got ~4000 isotigs, ~17,000 contigs and ~30, 000 Singlton that were not assembled. Is this the correct way to do this? I couldn’t find any publication that has mentioned this method.

gene transcriptome • 2.5k views
ADD COMMENTlink modified 10 months ago by njtulsani20 • written 6.3 years ago by Kiriya100

Which identity thresholds did you use?

ADD REPLYlink written 5.3 years ago by Yannick Wurm2.2k

Algorithms for CD-HIT were described in three papers published in Bioinformatics.

  1. Clustering of highly homologous sequences to reduce the size of large protein databases. Weizhong Li, Lukasz Jaroszewski & Adam Godzik. Bioinformatics (2001) 17:282-283, PDF, Pubmed
  2. Tolerating some redundancy significantly speeds up clustering of large protein databases. Weizhong Li, Lukasz Jaroszewski & Adam Godzik. Bioinformatics (2002) 18: 77-82, PDF, Pubmed
  3. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Weizhong Li & Adam Godzik. Bioinformatics (2006) 22:1658-1659 PDF, Pubmed
  4. CD-HIT: accelerated for clustering the next generation sequencing data. Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu and Weizhong Li, Bioinformatics (2012) 28:3150-3152, doi: 10.1093/bioinformatics/bts565 PDF

Please check these papers about CD HIT

ADD REPLYlink written 10 months ago by njtulsani20
1
gravatar for Damian Kao
6.3 years ago by
Damian Kao14k
USA
Damian Kao14k wrote:

Why do you want to further cluster the reads? Isotigs are grouped into isogroups in Newbler. Think of isogroup as the gene and isotigs as the alternate splice forms. Isotigs are made from contigs in Newbler, so you don't need to cluster the isotigs with the contigs.

If you feel you can get more data out of the unassembled reads, you can try cdhit + cap3 the unassembled reads with the isotigs.

ADD COMMENTlink written 6.3 years ago by Damian Kao14k

Thanks DK for your answer. For the annotation, should I keep the longest Isotig from each Isogroup?

ADD REPLYlink written 6.3 years ago by Kiriya100

That's more tricky. Longest isotig doesn't always mean the most inclusive transcript. I would just report all the possible isoforms.

ADD REPLYlink written 6.3 years ago by Damian Kao14k
1
gravatar for lexnederbragt
6.3 years ago by
lexnederbragt1.2k
Oslo, Norway
lexnederbragt1.2k wrote:

I got recommended the same method from 454. Some isotigs from the same isogroup are very similar, with just a few bases difference due to the heterozygotic nature of the sample(s) sequenced. The use of CDHit will allow for clustering these transcripts (isotigs). So, after clustering, you should again look how many transcripts (isotigs) there are for each isogroup. In principle, isotigs should cluster with isotigs from the same isogroup only. After clustering, remaining isotigs from the same isogroup could very well represent real isoforms.

ADD COMMENTlink written 6.3 years ago by lexnederbragt1.2k
0
gravatar for njtulsani
10 months ago by
njtulsani20
njtulsani20 wrote:

Applicatioins can be found from the papers that cited CD-HIT (external links to Google Scholar): Li et al (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Li et al (2001) Clustering of highly homologous sequences to reduce the size of large protein database. Li et al (2002) Tolerating some redundancy significantly speeds up clustering of large protein databases. Huang et al (2010) CD-HIT Suite: a web server for clustering and comparing biological sequences. Niu et al (2009) Artificial and natural duplicates in pyrosequencing reads of metagenomic data

ADD COMMENTlink written 10 months ago by njtulsani20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 946 users visited in the last hour