I have a denovo assembly using a multi-kmer approach (10step range from 21 to 71) on 8 separate samples. I merged the samples and ran cd-hit-est using a 90% identity cutoff. I ran a second instance using a 80% cutoff. However, my assembly still contains 450k contigs. Is it reasonable to continue lowering the identity threshold to something lower in order to get a more "reasonable" transcriptome?
BG on the assembly: 8 samples, each assembled with a multi-kmer approach ranging from 21-71, 10step using Bridger. All resulting contigs were merged together to form a reference (for DGE analysis). CD-HIT-EST was ran twice, first with 0.9 identity, then with 0.8. Resulting file still contains 450k contigs. Original non-clustered transcriptome was well over 2M contigs.
Is the sequencing for each sample deep enough to get a good assembly? I would concatenate the fastq files and assemble all samples in one run (or at least all samples from the same genetic background, if you are working with model organisms).
Could you post your Bridger command?
you have to run cd-hit-est on multiple thresholds and check at which point there is drastic change. The threshold above that drastic change would be better threshold. Having said that there is no gold standard limit to set for the tool, but to choose seemingly better one.