Question: Gene Set(S) For Gene Predictor Training
gravatar for Darked89
10.7 years ago by
Barcelona, Spain
Darked894.2k wrote:

I am trying to use CONRAD for gene prediction using novel plant genome sequence. As a training set it asks for at least 200 correct genes, but using more of them should give me more accurate predictions.

Here I bump into chicken and egg problem: it is hard to get even 200 reliable genes for a novel plant (NCBI has 500+ protein entries but these are mostly mitochondrial / chloroplast genes). I already have AUGUSTUS predicted genes, some with 100% level of RNA-Seq support, but these are at least few problems with them:

  • doe to no training set, AUGUSTUS was run with Arabidopsis gene models
  • it is a stretch to call an output of a gene predictor "reliable"
  • spliced RNA-Seq mappers and gene predictors cut corners using canonical splice sites

I have ca 30k of ESTs from the plant but from various strains, several lanes of Illumina RNA-Seq also from various strains. Genomic sequence (454 mostly) is at draft stage, with multiple gaps likely swallowing some exons. Few Sanger sequenced BACs.

My idea would be to:

  • start with 1000 (2000?) (non-mitochondrial and non-chloroplast) proteins most highly conserved among 5(?) plant species
  • filter those against repeat library (transposons etc.),
  • map these to genome using exonerate
  • map all RNA-Seq and all ESTs to regions identified above
  • assembly RNA-Seq and all ESTs identifies in the previous step => cDNAs
  • check if cDNAs translations are sane by comparing them to protein sequences from other species
  • align reliable cDNAs to genome
  • manually check as many genes from this final set as possible.

My questions:

  • is it possible to speed up the whole procedure?
  • going in the opposite direction: how to improve it/make it better?
sequencing next-gen gene • 3.8k views
ADD COMMENTlink modified 2.5 years ago by Ram32k • written 10.7 years ago by Darked894.2k

novel plant genome? how novel are we talking about? the odds are there are some closely related species you can use to map cDNAs with GMAP. also try to use a couple of ab initio prediction softwares and use a combiner to get the consensus.

ADD REPLYlink written 10.5 years ago by Haibao Tang3.0k

I think your approach looks very well thought out already. The mapping of transcripts (RNA-seq) is believed to be state-of-the-art in gene-structure predictions. Just remember full-length(?) cDNA was used in Medicago truncatula gene annotation for training. Possible improvement: There are many more euk. gene predition tools on the market (eg. EUGENE), I can post a list if you like.

ADD REPLYlink written 10.7 years ago by Michael Dondrup48k

It is from amaranth family: not that novel/strange. Tblastn of Augustus predictions picks sensible ESTs from multiple species for parts of proteins not recognized by blastp. I am a bit concerned that almost everything what can be spotted by GMAP on nucleotide level will be already detected by exonerate using protein2genome. But I will check GMAP with other species ESTs.

ADD REPLYlink written 10.5 years ago by Darked894.2k
gravatar for Larry_Parnell
10.5 years ago by
Boston, MA USA
Larry_Parnell16k wrote:

We grappled with similar issues on the Arabidopsis genome project, albeit with different software: GenScan, MZEF, FGenesH, GRAIL, etc. One of the biggest problems for gene prediction algorithms is the 2-exon gene and its lack of non-terminal or doubly-spliced exons. So, make sure you have a fair number of these genes in your training set. In fact, I would have a distribution of genes composed of 1, 2, 3 and 4+ exons that rather closely matches what you see from a similar plant - monocot or dicot at the very least.

Mitochondrial or cholorplast proteins are fine to include but do not include genes encoded by the genomes of those organelles.

I found very good concordance between A. thaliana exon sizes and splice sites and those from soy (Glycine max). Thus, your alignment ideas seem reasonable.

I would not worry so much about speeding things up because taking the time now to build a good training set will be well worth the effort when you have a better predictor.

ADD COMMENTlink written 10.5 years ago by Larry_Parnell16k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1054 users visited in the last hour