Hello, I have ~200000 de novo assembled plant transcripts and have to do a proteome annotation. I have run BLASTX of my transcripts against protein databases. However, I got very poor hits. My final goal is to functionally annotate my transcripts. Any suggestions? A very detailed explanation is very appreciated since I have never worked in this field. Also, any other suggestions on this topic are welcomed.
Try to do blastx with adjusting higher e value parameters (may be e-10) to get hits. After that if no hits found, you can go for domain annotation like pfam and then for Gene ontology annotation and then you can proceed for KEGG annotation. You can use blast2go tool which does all above annotations.
If you are working with grass genomes then you are in luck because there are several well annotated grass genomes (e.g., rice, maize, sorghum, etc.). A good place to start would be the gramene website, which is a resource for working with plant genomes. It is not possible to fully explain what path you should take without knowing your end goal. For example, it's not clear if you are just trying to annotate a genome or if there is some underlying biological question that you are actually interested in. Whatever your goal is, it may be helpful to know there is an archive of data on the gramene site that allows you to bulk download genes, ontologies, pathway information, etc. That may give you faster access to the data you need, rather than trying to construct these resources yourself.
It seems weird that you are not getting very good hits. What's the size distribution of your transcripts? Maybe they are mostly very short? Also, many of your transcripts might be non-coding. I would expect about half the sequences to be non-coding based on my experience with blueberry and working with a draft genome, which is of course a very different project than yours. Others may have a much better idea - I've only done this type of thing for one plant. Regarding which plant databases to use: I would recommend getting all the fully sequenced annotated plant protein RefSeq databases and maybe supplementing those with proteomes from Phytozome. However, there's a catch - some plant genomes have many more functional annotations than others. Probably Arabidopsis is the most extensively annotated, followed by rice. Tomato also seems pretty well annotated. Another databases you definitely want to use for annotation is the PlantCyc enzyme database. Once you sort out the informatics, you can use it to assign plant pathway accessions, which can be incredibly useful if you're going after medicinal compounds or other metabolic pathways.
Quoted from Maker-P overview
"Sequencing diverse plant species of evolutionary, agricultural, and medicinal interest is becoming routine for even small groups - genome annotation and analysis is much less so. The MAKER-P pipeline is designed to make the annotation of novel plant genomes tractable for small groups with limited bioinformatics experience and resources, and faster and more transparent for large groups with more experience and resources. The MAKER-P pipeline generates species-specific repeat libraries, as well as structural annotations of protein coding genes, non-coding RNAs, and pseudogen"