Trinity de novo transcript assembly too many transcripts

2

Entering edit mode

9.1 years ago

Mehmet ▴ 820

Dear All,

I have completed Trinity de novo assembly for a worm species. The k-mer that I used was 25. Then, I have had too many genes and transcripts. How can I remove the duplications of the same transcripts?

>c4_g1_i1 len=584 path=[53:0-583]
>c5_g1_i1 len=221 path=[47:0-166 213:167-220]
>c6_g1_i1 len=223 path=[735:0-15 737:16-222]
Total trinity 'genes':    29340
Total trinity transcripts:    37318
    Total assembled bases: 21926265

RNA-Seq Assembly • 4.0k views

ADD COMMENT • link updated 24 months ago by Ram 44k • written 9.1 years ago by Mehmet ▴ 820

1

Entering edit mode

Well, actually I don't think that that number is too big. I've used trinity for a while, and it's normal to have more genes and transcripts than you expected. I would cluster the transcripts by identity to reduce the overall size of the transcriptome without removing any sequence information by only removing 'redundant' (or highly similar) sequences. Also it could be a good idea to perform some kind of contaminant filtering by blast (to remove for example those transcript that have human hits).

ADD REPLY • link 9.1 years ago by iraun 6.2k

0

Entering edit mode

Thank you for your advice. Do you mind If I ask you how to cluster the transcripts? I mean which tools or scripts can be used?

ADD REPLY • link 9.1 years ago by Mehmet ▴ 820

3

Entering edit mode

There are others but, usually I do the clustering using cd-hit.

ADD REPLY • link updated 5.3 years ago by Ram 44k • written 9.1 years ago by iraun 6.2k

1

Entering edit mode

cd-hit-est, to be specific, no?

ADD REPLY • link 9.1 years ago by Ram 44k

0

Entering edit mode

Yes, exactly :)

ADD REPLY • link 9.1 years ago by iraun 6.2k

Login before adding your answer.