Entering edit mode
7.8 years ago
Mehmet
▴
790
Dear All,
I have completed Trinity de novo assembly for a worm species. The k-mer that I used was 25. Then, I have had too many genes and transcripts. How can I remove the duplications of the same transcripts?
>c4_g1_i1 len=584 path=[53:0-583]
>c5_g1_i1 len=221 path=[47:0-166 213:167-220]
>c6_g1_i1 len=223 path=[735:0-15 737:16-222]
Total trinity 'genes': 29340
Total trinity transcripts: 37318
Total assembled bases: 21926265
Well, actually I don't think that that number is too big. I've used trinity for a while, and it's normal to have more genes and transcripts than you expected. I would cluster the transcripts by identity to reduce the overall size of the transcriptome without removing any sequence information by only removing 'redundant' (or highly similar) sequences. Also it could be a good idea to perform some kind of contaminant filtering by blast (to remove for example those transcript that have human hits).
Thank you for your advice. Do you mind If I ask you how to cluster the transcripts? I mean which tools or scripts can be used?
There are others but, usually I do the clustering using cd-hit.
cd-hit-est, to be specific, no?
Yes, exactly :)