How stringent/loose should clustering be (CD-HIT-EST/USearch) for multi-kmer assemblies?
Entering edit mode
5.9 years ago
satshil.r ▴ 50


I have a denovo assembly using a multi-kmer approach (10step range from 21 to 71) on 8 separate samples. I merged the samples and ran cd-hit-est using a 90% identity cutoff. I ran a second instance using a 80% cutoff. However, my assembly still contains 450k contigs. Is it reasonable to continue lowering the identity threshold to something lower in order to get a more "reasonable" transcriptome?

BG on the assembly: 8 samples, each assembled with a multi-kmer approach ranging from 21-71, 10step using Bridger. All resulting contigs were merged together to form a reference (for DGE analysis). CD-HIT-EST was ran twice, first with 0.9 identity, then with 0.8. Resulting file still contains 450k contigs. Original non-clustered transcriptome was well over 2M contigs.


Thank you.

cd-hit RNA-Seq sequencing Trinity Velvet • 2.6k views
Entering edit mode

Is the sequencing for each sample deep enough to get a good assembly? I would concatenate the fastq files and assemble all samples in one run (or at least all samples from the same genetic background, if you are working with model organisms).

Could you post your Bridger command?

Entering edit mode

you have to run cd-hit-est on multiple thresholds and check at which point there is drastic change. The threshold above that drastic change would be better threshold. Having said that there is no gold standard limit to set for the tool, but to choose seemingly better one.


Login before adding your answer.

Traffic: 1932 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6