Question

why cd-hit-est tool could not recognize identical contigs in the assembled transcriptome input file?

0

Entering edit mode

9.2 years ago

seta ★ 1.9k

Hi everybody,

I have almost high quality Illumina reads that were trimmed and assembled using CLC genomic workbench software, I tried different K-mer size and got the best assembly, in terms of some basic parameters, like N50, the number of contigs and the percentage of mapped back reads (about 90%) in K-mer of 64. This assembly (44.8 MB in size) was subjected to cd-hit tool with strict 100% identity for the alignments since plants have many paralogues with have high sequence identity, but, the size of output file has not been changed. I also tried this analysis with stringency of 0.9, and the output size file (43.5MB) did not change significantly as compared with the original input file (44.8 MB). Could anybody please let me know whether these results are usual or there is something wrong? Thanks for any comments

RNA-Seq Assembly blast • 2.2k views

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.2 years ago by seta ★ 1.9k

1

Entering edit mode

Why would you expect to have identical contigs in your assembly? The point of an assembler is to collapse and extend overlapping sequences.

ADD REPLY • link updated 23 months ago by Ram 43k • written 9.2 years ago by Istvan Albert 100k

Ram · Answer 1 · 2015-02-28

0

Entering edit mode

9.2 years ago

seta ★ 1.9k

I have no experience in this filed and expect some redundancy just based on similar published works. So, it's normal in your view

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.2 years ago by seta ★ 1.9k

0

Entering edit mode

I don't think any assembler should ever generate exactly identical contigs. I can understand how similar contigs could be present but it would not be reasonable to expect identical contigs.

ADD REPLY • link 9.2 years ago by Istvan Albert 100k