Question

Annotating sequences after de-novo Trinity assembly and RSEM analysis...there must be an easier way!

4

Entering edit mode

9.7 years ago

samantha_jeschonek ▴ 50

Hello, I'm hoping someone can provide some insight or point me in the right direction... I have very little programming knowledge and am fairly new to RNA-seq but I'm sure there must be an easier way to do what I need...

Using Trinity de novo assembly, I have assembled my paired end reads for my RNA-seq data. I have also used the trinity RSEM utility to calculate transcript abundance. I now would like to annotate or identify, by protein name, those transcripts most highly expressed in certain samples.

Currently, what I am doing is importing the output RSEM file (RSEM.genes.results), with FPKM values, into an excel / tab-delineated file, then sorting by highest FPKM. Then, I search for the gene id corresponding to the FPKM value in the output trinity assembly (.fasta). There, I can find the corresponding sequence, and then I manually input that into the nucleotide blast database on pubmed...for each individual gene.

This is a very cumbersome and tedious approach and I am certain there is more automated way to do this. I have very limited programming experience so I cannot quickly write a script to do the above for me...but I'm almost positive there must be some built in trinity function or other already established script that can do this. What is the approach that is generally taken? I would be extremely grateful if you could point me in the right direction! Thank you for any help!

blast RNA-Seq • 17k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by samantha_jeschonek ▴ 50

1

Entering edit mode

You're correct that this is a good candidate process for automation. I don't personally know of any existing tool that does exactly what you're wanting, but I'd be happy to write a quick script; it would be a good exercise for me.

If you can post Dropbox links to an example of your RSEM output file, and maybe an Excel file in progress, that would give me a full understanding of what you're trying to do.

ADD REPLY • link 9.7 years ago by Dan D 7.4k

1

Entering edit mode

My group has recently developed a pipeline for transcriptome annotation (Annocript). The pipeline identified both coding and non-coding RNAs, and after preliminary configuration (parameters, database download, additional software installation) is completely automated for all future runs. It is comparatively faster than current annotation pipeline and gives protein, domain, GO term, Enzype and Pathway annotation. Further it estimates ORF size and non coding potential of each transcript to assign a binary classification for the transcript to be coding or non-coding.

Pipeline: https://github.com/frankMusacchia/Annocript
Publication: http://www.ncbi.nlm.nih.gov/pubmed/25701574
Mailing list: https://groups.google.com/forum/#!forum/annocript

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.2 years ago by projectbasu ▴ 10

Ram · Answer 1 · 2014-08-03

12

Entering edit mode

9.7 years ago

David Fredman ★ 1.1k

The following tools will automate the process to annotate a de novo assembled transcriptome for a species where an annotated reference set does not exist:

First, see the information regarding downstream analysis on the Trinity website

Trinotate describes the details of annotation in some detail and provides tools to collate all generated information

After party is another option that allow you to annotate (again Blast and Interpro), and set up a searchable web resource for your transcriptome.

Agalma is a pipeline that'll help you integrate transcriptomes into phylogenetic analyses

Blast2GO - a commercial stand-alone Java application that will Blast all your sequences, interpro domains, assign GO terms, and name your genes. You can also get the 10 (configurable) nearest Blast hits etc for each sequence.

Note that the (computational) time necessary to create these annotation datasets transcriptome-wide is quite long (several days), especially if you are using public services for Blast and Interpro. To speed up the process, one can e.g. import Blast results generated on a local cluster or the like, but this obviously requires more effort.

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by David Fredman ★ 1.1k

2

Entering edit mode

Only thing to add: BLAST2GO is terrible for large #'s of sequences (>30-40k), such as the #'s typically generated from trinity. Not to mention it's now gone primarily commercial, even the command line version (it's quite expensive for its use case).

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

Thanks for pointing that out Chris - both good points. I've used Blast2GO in the past with ~30,000 transcripts and it was painful enough ;) I didn't know they are now commercial! Will add a comment on that in the initial recommendation.

ADD REPLY • link 9.7 years ago by David Fredman ★ 1.1k

Ram · Answer 2 · 2015-04-26

3

Entering edit mode

9.0 years ago

GouthamAtla 12k

Annocript also does a decent job.

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.0 years ago by GouthamAtla 12k