Question

Genome Annotation Quality Measure

11

Entering edit mode

13.5 years ago

Darked89 4.6k

I have output of several gene prediction programs (using term loosely):

de novo predictors (Augustus, GlimmerHMM, Geneid, SNAP, Genscan)
RNA-Seq mapped with Tophat and Cufflinks
EST sets mapped with PASA (same species)/GMAP (2M+ plant ESTs)
10 protein sets mapped with exonerate

I also got:

semi-curated set of 1000 proteins (= non chimeric, non truncated, with correct size and similarity to other plant proteins, but exon borders may be at times wrong/small introns retained), ca 700 of them unique at 50% protein similarity level (uclust)
400+ CEGMA predictions based on HMM profiles of conserved set of genes

So far Augustus with RNA-Seq evidence support is way ahead at predicting sensible genes. I have been comparing numbers of "exons" shared between these sets, and I am puzzled by large numbers of exons unique for almost every method used. While this would be normal for de novo predictors, I was hoping that homology based methods (i.e exonerate protein to genome, GMAP and cufflinks) should overlap way more. I am going to work on improving individual programs results were possible (retraining, better filtering of ESTs/proteins, etc.).

I am looking for to some genome wide measure, telling me how good I am doing, be it for individual gene prediction program or some prediction combiner, as say compared to Arabidopsis and two three other recently annotated plant genomes. Any ideas?

genome gene • 6.2k views

ADD COMMENT • link updated 13.5 years ago by Daniel Standage 4.1k • written 13.5 years ago by Darked89 4.6k

score 5 · Answer 1 · 2010-11-17

5

Entering edit mode

13.5 years ago

biobot 0.0.77.a.1099 6.2k

Annotation Edit Distance devised by Eilbeck et al. might suit your needs, or be a place from which to start. From the paper: "AED is similar to performance measures employed by the gene-prediction community, but takes into account aspects of annotations not well addressed by conventional sensitivity/specificity measures such as alternative splicing."

ADD COMMENT • link 13.5 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

Great link. I've never seen this paper. I'll need to read it in detail. Probably has some applicability to what I'm currently working on!

ADD REPLY • link 13.5 years ago by Daniel Standage 4.1k

0

Entering edit mode

...However, AED also looks at individual annotations rather than giving a global measure, which I think is what is being asked here.

ADD REPLY • link 13.5 years ago by Daniel Standage 4.1k

0

Entering edit mode

Well, a global measure is a matter of aggregating the individual measurements. The paper plots cumulative AED for some genome releases over time. Or one might restrict the calculation to a subset of particularly important features for that organism, YMMV.

ADD REPLY • link 13.5 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

Thanks a lot, I will need some time to digest it.

ADD REPLY • link 13.5 years ago by Darked89 4.6k

Michael · Answer 2 · 2010-11-17

There was a thread that talked about this a while back with regards to individual gene models...indeed, you responded to it! (How to compare gene models) So if I understand correctly, you now want to know how to get a higher-level view rather than a per-gene-model comparison?

I spent a bit of time recently looking for software to do this and found little. Consequently, I've spent some time recently working on on a perl application to compare two sets of annotations. One set is treated as a reference, the other is treated as predictions, and it compares exon structure and coding nucleotide agreement.

It's not ready for prime time yet (there are a few small bugs and it still doesn't handle alternative splicing very well), but I've used it to do some comparisons and it has been very helpful. By default it provides a separate comparison for each gene model, but I should be able to force it to the whole sequence all at once (alternative splicing might complicate that, but I may be able to get something to work).

Let me know if you would like to talk details.