Question: Gene Set(S) For Gene Predictor Training
8
gravatar for Darked89
9.2 years ago by
Darked894.2k
Barcelona, Spain
Darked894.2k wrote:

I am trying to use CONRAD for gene prediction using novel plant genome sequence. As a training set it asks for at least 200 correct genes, but using more of them should give me more accurate predictions.

Here I bump into chicken and egg problem: it is hard to get even 200 reliable genes for a novel plant (NCBI has 500+ protein entries but these are mostly mitochondrial / chloroplast genes). I already have AUGUSTUS predicted genes, some with 100% level of RNA-Seq support, but these are at least few problems with them:

  • doe to no training set, AUGUSTUS was run with Arabidopsis gene models
  • it is a stretch to call an output of a gene predictor "reliable"
  • spliced RNA-Seq mappers and gene predictors cut corners using canonical splice sites

I have ca 30k of ESTs from the plant but from various strains, several lanes of Illumina RNA-Seq also from various strains. Genomic sequence (454 mostly) is at draft stage, with multiple gaps likely swallowing some exons. Few Sanger sequenced BACs.

My idea would be to:

  • start with 1000 (2000?) (non-mitochondrial and non-chloroplast) proteins most highly conserved among 5(?) plant species
  • filter those against repeat library (transposons etc.),
  • map these to genome using exonerate
  • map all RNA-Seq and all ESTs to regions identified above
  • assembly RNA-Seq and all ESTs identifies in the previous step => cDNAs
  • check if cDNAs translations are sane by comparing them to protein sequences from other species
  • align reliable cDNAs to genome
  • manually check as many genes from this final set as possible.

My questions:

  • is it possible to speed up the whole procedure?
  • going in the opposite direction: how to improve it/make it better?
sequencing next-gen gene • 3.4k views
ADD COMMENTlink modified 12 months ago by RamRS24k • written 9.2 years ago by Darked894.2k
2

novel plant genome? how novel are we talking about? the odds are there are some closely related species you can use to map cDNAs with GMAP. also try to use a couple of ab initio prediction softwares and use a combiner to get the consensus.

ADD REPLYlink written 9.1 years ago by Haibao Tang3.0k

I think your approach looks very well thought out already. The mapping of transcripts (RNA-seq) is believed to be state-of-the-art in gene-structure predictions. Just remember full-length(?) cDNA was used in Medicago truncatula gene annotation for training. Possible improvement: There are many more euk. gene predition tools on the market (eg. EUGENE), I can post a list if you like.

ADD REPLYlink written 9.2 years ago by Michael Dondrup46k

It is from amaranth family: not that novel/strange. Tblastn of Augustus predictions picks sensible ESTs from multiple species for parts of proteins not recognized by blastp. I am a bit concerned that almost everything what can be spotted by GMAP on nucleotide level will be already detected by exonerate using protein2genome. But I will check GMAP with other species ESTs.

ADD REPLYlink written 9.1 years ago by Darked894.2k
3
gravatar for Larry_Parnell
9.1 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

We grappled with similar issues on the Arabidopsis genome project, albeit with different software: GenScan, MZEF, FGenesH, GRAIL, etc. One of the biggest problems for gene prediction algorithms is the 2-exon gene and its lack of non-terminal or doubly-spliced exons. So, make sure you have a fair number of these genes in your training set. In fact, I would have a distribution of genes composed of 1, 2, 3 and 4+ exons that rather closely matches what you see from a similar plant - monocot or dicot at the very least.

Mitochondrial or cholorplast proteins are fine to include but do not include genes encoded by the genomes of those organelles.

I found very good concordance between A. thaliana exon sizes and splice sites and those from soy (Glycine max). Thus, your alignment ideas seem reasonable.

I would not worry so much about speeding things up because taking the time now to build a good training set will be well worth the effort when you have a better predictor.

ADD COMMENTlink written 9.1 years ago by Larry_Parnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1457 users visited in the last hour