Question: De novo transcriptome assembly produce too many transcripts
gravatar for 21afiq
6 months ago by
21afiq10 wrote:

Hi, I just finished my transcriptome assembly using Trinity. However, the transcripts produced by trinity is too many (~300k transcripts) which is not normal for my sample. I believe most of these transcripts are redundant. How can I remove these redundant transcript?

1) I already tried cdhit est. Unfortunately the output still contains many redundant transcript

2) I also already tried corset and follow the tutorial here ( However, currently I am stuck on how to recover the unigenes sequence from the corset output

3) I planned on trying to use TGICL to further remove redundant sequence from CD-hit output as done by some studies. However, I am a bit not familiar with TGICL and dont know which parameter to use

It would be happy me if somebody could help with my problem. Thanks

ADD COMMENTlink modified 6 months ago by Corentin320 • written 6 months ago by 21afiq10

Which organisme are you working in?

ADD REPLYlink written 6 months ago by kristoffer.vittingseerup2.0k

I always find it helpful to map the transcripts and view them in a genome browser. I find gmap to be the best mapper: Example command - might be out of date: gmap -f gff3_gene -D /lager2/rcug/seqres/HS/gmap/hg19_gmap -d hg19_gmap -B 5 -t 16 --intronlength=150000 --totallength=1000000 --npaths 1 -p 3 in.fa > in.fa.gff3

ADD REPLYlink written 6 months ago by colindaven1.6k
gravatar for Corentin
6 months ago by
Corentin320 wrote:

The Trinity FAQ states that having lot of transcripts is expected (I would advise you to read it if you have not already):

Lots of transcripts is the rule rather than the exception.

If you are still concerned by the number of transcripts, you can filter them based on their abundance. I usually filter transcripts which have a very low expression level in all the samples. They sometimes correspond to artifacts, but you also have the risk of filtering important transcripts that are just expressed at low levels, from the FAQ again:

Biological relevance of the lowly expressed transcripts could be questionable - some are bound to be very relevant.

I wrote a python script that prints the number of transcript against the expression levels (only works on linux): this can help you find the best threshold.

ADD COMMENTlink written 6 months ago by Corentin320
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 964 users visited in the last hour