Question: Trinity Transcriptome Assembly
1
gravatar for upendrakumar.devisetty
7.8 years ago by
United States
upendrakumar.devisetty370 wrote:

After Trinity assembler finished its assembly i managed to calculate the basic statistics of the assembly which are as below

File:  Trinity.fasta
Number:  158863
Total size:  176660784
Min size:  201
Max size:  22887
Average size:  1112.03
Median size:  665
N50:  1863
size @ 1Mbp:  11440  
Number @ 1Mbp:  65
size @ 2Mbp:  8461
Number @ 2Mbp:  170
size @ 4Mbp:  7088
Number @ 4Mbp:  430
size @ 10Mbp:  5424
Number @ 10Mbp:  1417

Now my question is does these values look reasonable. Though N50 looks good i am worried about the number of transcripts that are less than 1kb (~ 60%) of the overall transcripts. Is this normal in Trinity?

Also how do people normally do downstream analysis after getting the assembly to select the best transcritps. I ask this because the number of Transcripts is way higher than expect number of genes in related species.

Thanks

trinity denovo transcriptome • 8.6k views
ADD COMMENTlink modified 6.9 years ago by arnstrm1.8k • written 7.8 years ago by upendrakumar.devisetty370

Please fix formatting, it's very difficult to read the tables.

ADD REPLYlink written 7.8 years ago by Ketil4.0k
5
gravatar for Ketil
7.8 years ago by
Ketil4.0k
Germany
Ketil4.0k wrote:

Is it reasonable as a transcript assembler output? Possibly. Is it reasonable as an estimate of the real genes? Probably not, when doing transcript assembly, you invariably get a lot of junk: fragmented genes, merged genes, non-coding transcript fragments ("junk" RNA). It's hard to tell from sizes and numbers alone, since this will vary with species.

Did you compare to any other tools? Did you map back the reads to the transcripts, and count pairs and mapping percentages? Did you map transcripts to related transcriptomes to estmate errors and coverage? These are all things you can do to evaluate the assembly.

Finally, it's difficult to suggest how to select the "best transcripts", since it's not clear what you mean by "best". Are you looking for something in particular?

ADD COMMENTlink modified 7.8 years ago • written 7.8 years ago by Ketil4.0k

Thanks Ketil for your response. I had redone the table and it is the best i can do. Anyway i mapped all my Trinity transcripts to my reference genome and there is more than 99% mapping for the transcripts and when i blasted my Trinity transcripts to reference transcriptome (from a different accesion) i got ~84% and so it looks like they are all genuine. But the problem now is how do i deal with those 60% of transcripts that are less than 1kb. How do i know if they are fragmented or not?

ADD REPLYlink written 7.7 years ago by upendrakumar.devisetty370
1

I use my own tool (first asmeval then bamstats, both linked from http://blog.malde.org/ ) to map reads to transcripts, and then calculate statistics on various things. One recent development in the latter is to calculate "splits", that is read paris that span chromosomes (or rather, contigs, or in this case, putative transcripts). But you can probably do this easily with your own tools, it's not rocket surgery.

ADD REPLYlink modified 7.7 years ago • written 7.7 years ago by Ketil4.0k

The link for the blog is not working. Can you please fix it. Thanks

ADD REPLYlink written 7.7 years ago by upendrakumar.devisetty370

Sorry! Fixed now - the markup parser had included the end parenthesis in the URL :-) Thanks for pointing it out!

ADD REPLYlink modified 7.7 years ago • written 7.7 years ago by Ketil4.0k

+1 for "it's not rocket surgery" :)

ADD REPLYlink written 6.9 years ago by Eric Normandeau10k
2
gravatar for William
6.9 years ago by
William4.7k
Europe
William4.7k wrote:

Here is a presentation about different metrics you can / should use to asses the quality of your transcriptome assembly.

http://www.abrf.org/Committees/Education/Activities/ABRF2013_SW1_oneil_DeNovo-transccript-Assembly.pdf

ADD COMMENTlink written 6.9 years ago by William4.7k

interesting ways to show metrics...

ADD REPLYlink written 6.9 years ago by Rm8.0k
1
gravatar for Biojl
7.7 years ago by
Biojl1.7k
Barcelona
Biojl1.7k wrote:

You are expected to get a lot more transcripts than genes, along with what Ketil say you must also take into account alternative splicing. If you're working with an eukaryote species and mapping to a decent genome assembly (human, mouse, etc) you should expect to find different isoforms coming from the same gene. Try calculating how many UNIQUE ID's hits you get for genes and transcripts.

For the short transcripts you could filter them for an ORF higher than say 100 bp or the value that best fits you.

ADD COMMENTlink written 7.7 years ago by Biojl1.7k

Thanks both of you for your suggestions and comments. During the last few days i have learnt a lot about the Trinity output file going through forums (The trinity website is least explained regarding this). As said above i figure out that my Trinity output consisted of lots of isofoms and sometimes i even found one transcript that had 202 isoroms and so the output is not surprising at all. Also when i checked my reference transcriptome i found ~52% of genes have a gene length of <1kb and so i am pretty happy with what i found with my trinity assembly

I plan to do something like this for trinity output to select best transcripts (copying this pipeline from some other forum)

  1. expression based: after running the abundance estimation, retain those that have some minimum FPKM value (such as 1).

  2. run the ORF extraction pipeline included in Trinity (don't restrict it to complete ORFs, get both complete and partials) - retain those that encode long ORFs (eg. 200 aa)

  3. blastx the trinity transcripts against uniref90, retain those that have homology to known proteins (E<=1e-10)

Take the union of {1,2,3} above and call it 'best'.

Thanks again for your help guys.

I will update you once i finish the analysis.....

ADD REPLYlink modified 7.7 years ago • written 7.7 years ago by upendrakumar.devisetty370
1
gravatar for arnstrm
6.9 years ago by
arnstrm1.8k
Ames, IA
arnstrm1.8k wrote:

Also, I would like to point out this article "Optimizing de novo assembly of short-read RNA-seq data for phylogenomics". Although it is for phylogenomics, the method can be applied for any studies.

ADD COMMENTlink written 6.9 years ago by arnstrm1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1615 users visited in the last hour