Question: Plant Annotation Workflow
1
gravatar for User000
5.3 years ago by
User000270
User000270 wrote:

Hello, I have ~200000 de novo assembled plant transcripts and have to do a proteome annotation. I have run BLASTX of my transcripts against protein databases. However, I got very poor hits. My final goal is to functionally annotate my transcripts. Any suggestions? A very detailed explanation is very appreciated since I have never worked in this field. Also, any other suggestions on this topic are welcomed.

bioinformatics • 3.2k views
ADD COMMENTlink modified 5.3 years ago by rtliu2.0k • written 5.3 years ago by User000270
2
gravatar for Mary
5.3 years ago by
Mary11k
Boston MA area
Mary11k wrote:

Have you looked around at the iPlant resources? They may have some useful guidance for you: http://www.iplantcollaborative.org/

ADD COMMENTlink written 5.3 years ago by Mary11k
1
gravatar for jackuser1979
5.3 years ago by
jackuser1979860
US
jackuser1979860 wrote:

Try to do blastx with adjusting higher e value parameters (may be e-10) to get hits. After that if no hits found, you can go for domain annotation like pfam and then for Gene ontology annotation and then you can proceed for KEGG annotation. You can use blast2go tool which does all above annotations.

ADD COMMENTlink written 5.3 years ago by jackuser1979860
1
gravatar for SES
5.3 years ago by
SES8.2k
Vancouver, BC
SES8.2k wrote:

If you are working with grass genomes then you are in luck because there are several well annotated grass genomes (e.g., rice, maize, sorghum, etc.). A good place to start would be the gramene website, which is a resource for working with plant genomes. It is not possible to fully explain what path you should take without knowing your end goal. For example, it's not clear if you are just trying to annotate a genome or if there is some underlying biological question that you are actually interested in. Whatever your goal is, it may be helpful to know there is an archive of data on the gramene site that allows you to bulk download genes, ontologies, pathway information, etc. That may give you faster access to the data you need, rather than trying to construct these resources yourself.

ADD COMMENTlink written 5.3 years ago by SES8.2k
1
gravatar for Ann
5.3 years ago by
Ann2.2k
Concord NC USA
Ann2.2k wrote:

It seems weird that you are not getting very good hits. What's the size distribution of your transcripts? Maybe they are mostly very short? Also, many of your transcripts might be non-coding. I would expect about half the sequences to be non-coding based on my experience with blueberry and working with a draft genome, which is of course a very different project than yours. Others may have a much better idea - I've only done this type of thing for one plant. Regarding which plant databases to use: I would recommend getting all the fully sequenced annotated plant protein RefSeq databases and maybe supplementing those with proteomes from Phytozome. However, there's a catch - some plant genomes have many more functional annotations than others. Probably Arabidopsis is the most extensively annotated, followed by rice. Tomato also seems pretty well annotated. Another databases you definitely want to use for annotation is the PlantCyc enzyme database. Once you sort out the informatics, you can use it to assign plant pathway accessions, which can be incredibly useful if you're going after medicinal compounds or other metabolic pathways.

ADD COMMENTlink written 5.3 years ago by Ann2.2k

they are short, yes, my wheat is also tetraploid, so problem of homeologs. I did blast against Phytozome as well, I got hits for less than half of the contigs. At the end the only database that gave me more hits was Ensembl. Why is that half of the seq-s are non-coding? May I ask you which workflow did you use to annotate blueberry? thank you for the post

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by User000270
0
gravatar for nbvasani
5.3 years ago by
nbvasani230
United States
nbvasani230 wrote:

Download protein database of Arabidopsis thaliana db from NCBI or uniport, then run blastx against your transcripts data. As Arabidopsis thaliana resemble to many of plant species, you will get lots of hits.

ADD COMMENTlink written 5.3 years ago by nbvasani230
1

As Arabidopsis thaliana resemble to many of plant species, you will get lots of hits.

That is not a reasonable statement to make. Arabidopsis thaliana is a species that has a per base pair substitution rate several times higher than the average angiosperm. So, it is not a good choice for finding distant homologies.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by SES8.2k

I agree with "Arabidopsis thaliana is a species that has a per base pair substitution rate several times higher than the average angiosperm." As per my understanding, best route to start annotation is to start with plant species which is well studied and resemble to your plant species.

ADD REPLYlink written 5.3 years ago by nbvasani230
1

Okay, you agree but then you repeat the same thing by saying, "start with plant species which...resemble your plant species." My point was that we don't know what species is being annotated and using Arabidopsis alone is not the best choice.

ADD REPLYlink written 5.3 years ago by SES8.2k
1

You are right Arabidopsis alone is not the best choice. But he has to start from some plant db in order to annotate his assembly. If you have any suggestion let User000 know instead of making unnecessary statement.

ADD REPLYlink written 5.3 years ago by nbvasani230

There's no need to be argumentative. My statement is very relevant. By having discussion about better ways of doing things we are helping OP find a solution. We should focus on that point and not take comments about genome annotation personal.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by SES8.2k

infact, nbvasani said plant specieS, meaning several plants I guess. Anyway, as I have mentioned above I am using 9 plant species, also I am going to create a database of contaminants, which will include plant pathogens, human, mice. Other ideas are welcomed

ADD REPLYlink written 5.3 years ago by User000270

I have created a plant protein database, which includes arabidopsis thaliana, rice, barley 9 species in total, however, still very poor hits. OK, let assume, I want to at least annotate those ones that have >90% identity, what would you suggest me to do next?

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by User000270

You can try to run blastx against nr database from NCBI. Are you interested in differential expressed transcripts?

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by nbvasani230

I tried to run blastx against trembl, blast is extremely slow, so I could not go on.. see my related post C: Speeding up the BLAST job

ADD REPLYlink written 5.3 years ago by User000270
1

Yap, it generally take a week or so. Instead of concentrating on whole de novo assembly, try to generate DGE list, then run blastx against all db you have. DGE list will be easy to handle compare to de novo assembly and you will get your result faster. Let say you still find less number of hits, you can still try to blast your transcripts sequence manually in blastn NCBI one by one, as number of transcripts with DGE list will be far lesser compare to de novo list.

ADD REPLYlink written 5.3 years ago by nbvasani230

how to generate DGE? if it was a week..it is going to take me 2 months..

ADD REPLYlink written 5.3 years ago by User000270

You can generate Differential Gene expression (DGE) list by using R package i.e. edgeR and DEseq.

ADD REPLYlink written 5.3 years ago by nbvasani230

I dont see a point of blasting something manually, if I can download nr database from NCBI...and if really with DGE it will be faster..no?

ADD REPLYlink written 5.3 years ago by User000270

It all depends on what you want from your data. With DGE list it will be faster.

ADD REPLYlink written 5.3 years ago by nbvasani230

Hi User000,

Is any NGS article related to your plant species published? You might get some clue for your anoatation from that article. If it's ok with you, can you tell me your plant species?

ADD REPLYlink written 5.3 years ago by nbvasani230

I am working with triticum durum, which is tetraploid. The contigs have been de novo assembled before using CLC (I did not do that part). There are 2-3 articles related to triticum, however, I need more information and may be more detailed.

ADD REPLYlink written 5.3 years ago by User000270

Great! Contact author of that articles they will suggest how they annotated their assembly.

ADD REPLYlink written 5.3 years ago by nbvasani230
0
gravatar for rtliu
5.3 years ago by
rtliu2.0k
New Zealand
rtliu2.0k wrote:

Maker-P has recently been used to annotate Loblolly Pine genome - link

Quoted from Maker-P overview

"Sequencing diverse plant species of evolutionary, agricultural, and medicinal interest is becoming routine for even small groups - genome annotation and analysis is much less so. The MAKER-P pipeline is designed to make the annotation of novel plant genomes tractable for small groups with limited bioinformatics experience and resources, and faster and more transparent for large groups with more experience and resources. The MAKER-P pipeline generates species-specific repeat libraries, as well as structural annotations of protein coding genes, non-coding RNAs, and pseudogen"

ADD COMMENTlink written 5.3 years ago by rtliu2.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1473 users visited in the last hour