How stringent/loose should clustering be (CD-HIT-EST/USearch) for multi-kmer assemblies?
0
2
Entering edit mode
5.9 years ago
satshil.r ▴ 50

Hello,

I have a denovo assembly using a multi-kmer approach (10step range from 21 to 71) on 8 separate samples. I merged the samples and ran cd-hit-est using a 90% identity cutoff. I ran a second instance using a 80% cutoff. However, my assembly still contains 450k contigs. Is it reasonable to continue lowering the identity threshold to something lower in order to get a more "reasonable" transcriptome?

BG on the assembly: 8 samples, each assembled with a multi-kmer approach ranging from 21-71, 10step using Bridger. All resulting contigs were merged together to form a reference (for DGE analysis). CD-HIT-EST was ran twice, first with 0.9 identity, then with 0.8. Resulting file still contains 450k contigs. Original non-clustered transcriptome was well over 2M contigs.

 

Thank you.

cd-hit RNA-Seq sequencing Trinity Velvet • 2.6k views
ADD COMMENT
0
Entering edit mode

Is the sequencing for each sample deep enough to get a good assembly? I would concatenate the fastq files and assemble all samples in one run (or at least all samples from the same genetic background, if you are working with model organisms).

Could you post your Bridger command?

ADD REPLY
0
Entering edit mode

you have to run cd-hit-est on multiple thresholds and check at which point there is drastic change. The threshold above that drastic change would be better threshold. Having said that there is no gold standard limit to set for the tool, but to choose seemingly better one.

ADD REPLY

Login before adding your answer.

Traffic: 1932 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6