I have output of several gene prediction programs (using term loosely):
- de novo predictors (Augustus, GlimmerHMM, Geneid, SNAP, Genscan)
- RNA-Seq mapped with Tophat and Cufflinks
- EST sets mapped with PASA (same species)/GMAP (2M+ plant ESTs)
- 10 protein sets mapped with exonerate
I also got:
semi-curated set of 1000 proteins (= non chimeric, non truncated, with correct size and similarity to other plant proteins, but exon borders may be at times wrong/small introns retained), ca 700 of them unique at 50% protein similarity level (uclust)
400+ CEGMA predictions based on HMM profiles of conserved set of genes
So far Augustus with RNA-Seq evidence support is way ahead at predicting sensible genes. I have been comparing numbers of "exons" shared between these sets, and I am puzzled by large numbers of exons unique for almost every method used. While this would be normal for de novo predictors, I was hoping that homology based methods (i.e exonerate protein to genome, GMAP and cufflinks) should overlap way more. I am going to work on improving individual programs results were possible (retraining, better filtering of ESTs/proteins, etc.).
I am looking for to some genome wide measure, telling me how good I am doing, be it for individual gene prediction program or some prediction combiner, as say compared to Arabidopsis and two three other recently annotated plant genomes. Any ideas?