We have sequenced a transcriptome of a species who does not have a sequenced genome, using 454 and our initial goal is to find a set of ESTs that represent genes. The 454 reads were assembled using Newbler 2.5 and the initial assembly gave ~26000 isotigs and 18,000. Contigs. After talking to the several people, I used CD-Hits program to combine the isotigs, contigs and singltons that were not assembled. After combining these sequences, I got ~4000 isotigs, ~17,000 contigs and ~30, 000 Singlton that were not assembled. Is this the correct way to do this? I couldn’t find any publication that has mentioned this method.
Why do you want to further cluster the reads? Isotigs are grouped into isogroups in Newbler. Think of isogroup as the gene and isotigs as the alternate splice forms. Isotigs are made from contigs in Newbler, so you don't need to cluster the isotigs with the contigs.
If you feel you can get more data out of the unassembled reads, you can try cdhit + cap3 the unassembled reads with the isotigs.
I got recommended the same method from 454. Some isotigs from the same isogroup are very similar, with just a few bases difference due to the heterozygotic nature of the sample(s) sequenced. The use of CDHit will allow for clustering these transcripts (isotigs). So, after clustering, you should again look how many transcripts (isotigs) there are for each isogroup. In principle, isotigs should cluster with isotigs from the same isogroup only. After clustering, remaining isotigs from the same isogroup could very well represent real isoforms.
Applicatioins can be found from the papers that cited CD-HIT (external links to Google Scholar): Li et al (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Li et al (2001) Clustering of highly homologous sequences to reduce the size of large protein database. Li et al (2002) Tolerating some redundancy significantly speeds up clustering of large protein databases. Huang et al (2010) CD-HIT Suite: a web server for clustering and comparing biological sequences. Niu et al (2009) Artificial and natural duplicates in pyrosequencing reads of metagenomic data