I have a denovo assembly using a multi-kmer approach (10step range from 21 to 71) on 8 separate samples. I merged the samples and ran cd-hit-est using a 90% identity cutoff. I ran a second instance using a 80% cutoff. However, my assembly still contains 450k contigs. Is it reasonable to continue lowering the identity threshold to something lower in order to get a more "reasonable" transcriptome?
BG on the assembly: 8 samples, each assembled with a multi-kmer approach ranging from 21-71, 10step using Bridger. All resulting contigs were merged together to form a reference (for DGE analysis). CD-HIT-EST was ran twice, first with 0.9 identity, then with 0.8. Resulting file still contains 450k contigs. Original non-clustered transcriptome was well over 2M contigs.