Question

How stringent/loose should clustering be (CD-HIT-EST/USearch) for multi-kmer assemblies?

2

Entering edit mode

8.4 years ago

satshil.r ▴ 50

Hello,

I have a denovo assembly using a multi-kmer approach (10step range from 21 to 71) on 8 separate samples. I merged the samples and ran cd-hit-est using a 90% identity cutoff. I ran a second instance using a 80% cutoff. However, my assembly still contains 450k contigs. Is it reasonable to continue lowering the identity threshold to something lower in order to get a more "reasonable" transcriptome?

BG on the assembly: 8 samples, each assembled with a multi-kmer approach ranging from 21-71, 10step using Bridger. All resulting contigs were merged together to form a reference (for DGE analysis). CD-HIT-EST was ran twice, first with 0.9 identity, then with 0.8. Resulting file still contains 450k contigs. Original non-clustered transcriptome was well over 2M contigs.

Thank you

Trinity RNA-Seq sequencing Velvet cd-hit • 3.2k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by satshil.r ▴ 50

0

Entering edit mode

Is the sequencing for each sample deep enough to get a good assembly? I would concatenate the fastq files and assemble all samples in one run (or at least all samples from the same genetic background, if you are working with model organisms).

Could you post your Bridger command?

ADD REPLY • link 8.4 years ago by h.mon 35k

0

Entering edit mode

you have to run cd-hit-est on multiple thresholds and check at which point there is drastic change. The threshold above that drastic change would be better threshold. Having said that there is no gold standard limit to set for the tool, but to choose seemingly better one.

ADD REPLY • link 8.3 years ago by Prakki Rama ★ 2.7k