Question

Gene Set(S) For Gene Predictor Training

8

Entering edit mode

13.8 years ago

Darked89 4.6k

I am trying to use CONRAD for gene prediction using novel plant genome sequence. As a training set it asks for at least 200 correct genes, but using more of them should give me more accurate predictions.

Here I bump into chicken and egg problem: it is hard to get even 200 reliable genes for a novel plant (NCBI has 500+ protein entries but these are mostly mitochondrial / chloroplast genes). I already have AUGUSTUS predicted genes, some with 100% level of RNA-Seq support, but these are at least few problems with them:

doe to no training set, AUGUSTUS was run with Arabidopsis gene models
it is a stretch to call an output of a gene predictor "reliable"
spliced RNA-Seq mappers and gene predictors cut corners using canonical splice sites

I have ca 30k of ESTs from the plant but from various strains, several lanes of Illumina RNA-Seq also from various strains. Genomic sequence (454 mostly) is at draft stage, with multiple gaps likely swallowing some exons. Few Sanger sequenced BACs.

My idea would be to:

start with 1000 (2000?) (non-mitochondrial and non-chloroplast) proteins most highly conserved among 5(?) plant species
filter those against repeat library (transposons etc.),
map these to genome using exonerate
map all RNA-Seq and all ESTs to regions identified above
assembly RNA-Seq and all ESTs identifies in the previous step => cDNAs
check if cDNAs translations are sane by comparing them to protein sequences from other species
align reliable cDNAs to genome
manually check as many genes from this final set as possible.

My questions:

is it possible to speed up the whole procedure?
going in the opposite direction: how to improve it/make it better?

gene next-gen sequencing • 4.6k views

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.8 years ago by Darked89 4.6k

2

Entering edit mode

novel plant genome? how novel are we talking about? the odds are there are some closely related species you can use to map cDNAs with GMAP. also try to use a couple of ab initio prediction softwares and use a combiner to get the consensus.

ADD REPLY • link 13.7 years ago by Haibao Tang 3.0k

0

Entering edit mode

I think your approach looks very well thought out already. The mapping of transcripts (RNA-seq) is believed to be state-of-the-art in gene-structure predictions. Just remember full-length(?) cDNA was used in Medicago truncatula gene annotation for training. Possible improvement: There are many more euk. gene predition tools on the market (eg. EUGENE), I can post a list if you like.

ADD REPLY • link 13.8 years ago by Michael 54k

0

Entering edit mode

It is from amaranth family: not that novel/strange. Tblastn of Augustus predictions picks sensible ESTs from multiple species for parts of proteins not recognized by blastp. I am a bit concerned that almost everything what can be spotted by GMAP on nucleotide level will be already detected by exonerate using protein2genome. But I will check GMAP with other species ESTs.

ADD REPLY • link 13.7 years ago by Darked89 4.6k

score 3 · Answer 1 · 2010-08-13

We grappled with similar issues on the Arabidopsis genome project, albeit with different software: GenScan, MZEF, FGenesH, GRAIL, etc. One of the biggest problems for gene prediction algorithms is the 2-exon gene and its lack of non-terminal or doubly-spliced exons. So, make sure you have a fair number of these genes in your training set. In fact, I would have a distribution of genes composed of 1, 2, 3 and 4+ exons that rather closely matches what you see from a similar plant - monocot or dicot at the very least.

Mitochondrial or cholorplast proteins are fine to include but do not include genes encoded by the genomes of those organelles.

I found very good concordance between A. thaliana exon sizes and splice sites and those from soy (Glycine max). Thus, your alignment ideas seem reasonable.

I would not worry so much about speeding things up because taking the time now to build a good training set will be well worth the effort when you have a better predictor.