Question: Annotating sequences after de-novo Trinity assembly and RSEM analysis...there must be an easier way!
gravatar for samantha_jeschonek
3.5 years ago by
United States
samantha_jeschonek50 wrote:

Hello, I'm hoping someone can provide some insight or point me in the right direction... I have  very little programming knowledge and am fairly new to RNA-seq but I'm sure there must be an easier way to do what I need...

Using Trinity de novo assembly, I have assembled my paired end reads for my RNA-seq data.  I have also used the trinity RSEM utility to calculate transcript abundance.  I now would like to annotate or identify, by protein name, those transcripts most highly expressed in certain samples.  

Currently, what I am doing is importing the output RSEM file (RSEM.genes.results), with FPKM values, into an excel / tab-delineated file, then sorting by highest FPKM.  Then, I search for the gene id corresponding to the FPKM value in the output trinity assembly (.fasta).  There, I can find the corresponding sequence, and then I manually input that into the nucleotide blast database on pubmed...for each individual gene.

This is a very cumbersome and tedious approach and I am certain there is more automated way to do this. I have very limited programming experience so I cannot quickly write a script to do the above for me...but I'm almost positive there must be some built in trinity function or other already established script that can do this. What is the approach that is generally taken?  I would be extremely grateful if you could point me in the right direction! Thank you for any help!

blast rna-seq • 11k views
ADD COMMENTlink modified 2.8 years ago by geek_y8.2k • written 3.5 years ago by samantha_jeschonek50

You're correct that this is a good candidate process for automation. I don't personally know of any existing tool that does exactly what you're wanting, but I'd be happy to write a quick script; it would be a good exercise for me.

If you can post Dropbox links to an example of your RSEM output file, and maybe an Excel file in progress, that would give me a full understanding of what you're trying to do.

ADD REPLYlink written 3.5 years ago by Dan D6.3k

My group has recently developed a pipeline for transcriptome annotation (Annocript). The pipeline identified both coding and non-coding RNAs, and after preliminary configuration (parameters, database download, additional software installation) is completely automated for all future runs. It is comparatively faster than current annotation pipeline and gives protein, domain, GO term, Enzype and Pathway annotation. Further it estimates ORF size and non coding potential of each trasncript to assign a binary classification for the trasncript to be coding or non-coding.




Mailing list:!forum/annocript

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by projectbasu10
gravatar for David Fredman
3.5 years ago by
David Fredman910
University of Bergen, Norway
David Fredman910 wrote:

The following tools will automate the process to annotate a de novo assembled transcriptome for a species where an annotated reference set does not exist:

First, see the information regarding downstream analysis on the Trinity website 

Trinotate describes the details of annotation in some detail and provides tools to collate all generated information

After party is another option that allow you to annotate (again Blast and Interpro), and set up a searchable web resource for your transcriptome.

Agalma is a pipeline that'll help you integrate transcriptomes into phylogenetic analyses

Blast2GO - a commercial stand-alone Java application that will Blast all your sequences, interpro domains, assign GO terms, and name your genes. You can also get the 10 (configurable) nearest Blast hits etc for each sequence.

Note that the (computational) time necessary to create these annotation datasets transcriptome-wide is quite long (several days), especially if you are using public services for Blast and Interpro. To speed up the process, one can e.g. import Blast results generated on a local cluster or the like, but this obviously requires more effort.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by David Fredman910

Only thing to add: BLAST2GO is terrible for large #'s of sequences (>30-40k), such as the #'s typically generated from trinity.  Not to mention it's now gone primarily commercial, even the command line version (it's quite expensive for its use case).

ADD REPLYlink written 3.5 years ago by Chris Fields1.9k

Thanks for pointing that out Chris - both good points. I've used Blast2GO in the past with ~30,000 transcripts and it was painful enough ;) I didn't know they are now commercial! Will add a comment on that in the initial recommendation.

ADD REPLYlink written 3.5 years ago by David Fredman910
gravatar for geek_y
2.8 years ago by
geek_y8.2k wrote:

Annocript also does a decent job.

ADD COMMENTlink written 2.8 years ago by geek_y8.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 524 users visited in the last hour