I am trying to use CONRAD for gene prediction using novel plant genome sequence. As a training set it asks for at least 200 correct genes, but using more of them should give me more accurate predictions.
Here I bump into chicken and egg problem: it is hard to get even 200 reliable genes for a novel plant (NCBI has 500+ protein entries but these are mostly mitochondrial / chloroplast genes). I already have AUGUSTUS predicted genes, some with 100% level of RNA-Seq support, but these are at least few problems with them:
- doe to no training set, AUGUSTUS was run with Arabidopsis gene models
- it is a stretch to call an output of a gene predictor "reliable"
- spliced RNA-Seq mappers and gene predictors cut corners using canonical splice sites
I have ca 30k of ESTs from the plant but from various strains, several lanes of Illumina RNA-Seq also from various strains. Genomic sequence (454 mostly) is at draft stage, with multiple gaps likely swallowing some exons. Few Sanger sequenced BACs.
My idea would be to:
- start with 1000 (2000?) (non-mitochondrial and non-chloroplast) proteins most highly conserved among 5(?) plant species
- filter those against repeat library (transposons etc.),
- map these to genome using exonerate
- map all RNA-Seq and all ESTs to regions identified above
- assembly RNA-Seq and all ESTs identifies in the previous step => cDNAs
- check if cDNAs translations are sane by comparing them to protein sequences from other species
- align reliable cDNAs to genome
- manually check as many genes from this final set as possible.
- is it possible to speed up the whole procedure?
- going in the opposite direction: how to improve it/make it better?