Hi everybody,
I have almost high quality Illumina reads that were trimmed and assembled using CLC genomic workbench software, I tried different K-mer size and got the best assembly, in terms of some basic parameters, like N50, the number of contigs and the percentage of mapped back reads (about 90%) in K-mer of 64. This assembly (44.8 MB in size) was subjected to cd-hit tool with strict 100% identity for the alignments since plants have many paralogues with have high sequence identity, but, the size of output file has not been changed. I also tried this analysis with stringency of 0.9, and the output size file (43.5MB) did not change significantly as compared with the original input file (44.8 MB). Could anybody please let me know whether these results are usual or there is something wrong? Thanks for any comments
Why would you expect to have identical contigs in your assembly? The point of an assembler is to collapse and extend overlapping sequences.