Question: Strategy for genome annotation for a new individual of a well-established species
gravatar for liorglic
9 months ago by
liorglic40 wrote:

I am new to the topic of genome annotation and would like to get some advice regarding my planned strategy.
So, I am exploring genetic diversity in maize, in for this purpose I have de-novo assembled a genome of a variant which is supposed to be quite diverse from the reference. I'd now like to annotate the assembly. The strategy I have thought about is as follows:
Since maize has quite a lot of genomic resources, I don't see a reason to go for ab-initio annotation as a first attempt. Rather, I would like to base my annotation upon the existing annotation of the maize reference (B73) and a collection of transcripts which I have collected from previous publications. Unfortunately, I do not have RNA-Seq data coming from the individual I am annotating. I combined multiple transcripts sets and the official annotation and aligned them to my assembled sequence. However, result seems rather noisy, with ~150k predicted genes, which sounds too much. I am wondering what should be my next step. Maybe I should filter my transcripts set to reduce noise? I have already tried filtering out very short transcripts, but can I do something more sophisticated? Another option I have thought of is putting the transcripts through some clustering algorithm (e.g. OrthoMCL) and then take representative transcripts from each cluster and maybe remove singletons.
Is there a common way to assess the integrity of specific gene annotation? Maybe this could help me remove pseudo-genes and/or random alignments from my annotation results?
Have anyone here done something like this before? Would appreciate any thoughts or advice on the strategy I described here.
Thank you!

ADD COMMENTlink modified 9 months ago by Rohit1.3k • written 9 months ago by liorglic40

You can take a look at RATT which is now part of PAGIT.

ADD REPLYlink written 9 months ago by genomax59k

Thanks, I wasn't familiar with this software. However, since I'm also interested in detecting new genes not present in the reference annotation, I'd like to use the transcriptomic data on top of that, and this is where most of the noise comes from.

ADD REPLYlink written 9 months ago by liorglic40
gravatar for Rohit
9 months ago by
European union
Rohit1.3k wrote:

PASA has a pipeline where you can use existing information to improve the annotations. Gemoma also provides a workflow to use annotations from multiple references (not entirely transcript based) to annotate a target genomes. Probably the reason for ~150k predicted genes is you have isoforms indicated separately, where you would have to collapse the gene boundaries. Also, many of the predicted annotations can be pseudo-genes or lncRNAs, but separating these from genes is another problem altogether :|

ADD COMMENTlink written 9 months ago by Rohit1.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 813 users visited in the last hour