Tutorial: gene set filter/selection for training ab initio annotation tools
3
gravatar for Juke-34
6 days ago by
Juke-342.4k
Sweden
Juke-342.4k wrote:

Here I show how to filter an annotation (often evidence-based) to create a golden set of genes in order to train ab-initio annotation tools.

Prerequisite:

  • an annotation in gff or gtf
  • the GAAS toolkit
  • blast
  • to make it easier a nice folder structure:
    • mkdir filter
    • mkdir protein
    • mkdir nonredundant
    • mkdir blast_recursive

1. Select protein coding genes
First we select only the protein coding genes from the annotation file and remove all other type of level2 features: tRNA, snoRNA, etc.

gff3_sp_splitByLevel2Feature.pl -g maker_evidence.gff -o splitByLevel2  
ln -s splitByLevel2/mrna.gff

2. Optional (Only if Maker annotation): filter by score
Next step, we need to filter the best genes. We will keep gene with AED score under a certain score (e.g 0.3).

maker_select_models_by_AED_score.pl -f mrna.gff -v 0.3 -t "<=" -o filter/codingGeneFeatures.filter.gff

3. Longest isoform per mRNA
Now we filter mRNAs to keep only the longest isoform when several are avaialble.

gff3_sp_keep_longest_isoform.pl -f filter/codingGeneFeatures.filter.gff -o filter/codingGeneFeatures.filter.longest_cds.gff

4. Complete genes
We then check that the gene models are complete (has a start and stop codon), and remove the incomplete ones.

gff3_sp_filter_incomplete_gene_coding_models.pl --gff filter/codingGeneFeatures.filter.longest_cds.gff -f genome.fa -o filter/codingGeneFeatures.filter.longest_cds.complete.gff

4. Distance from neighboring genes
We can also filter the gene models by distance from neighboring genes in order to be sure to have nice intergenic region wihtout genes in it in order to train properly intergenic parameters. (default 500 bp)

gff3_sp_filter_by_locus_distance.pl --gff filter/codingGeneFeatures.filter.longest_cds.complete.gff -o filter/codingGeneFeatures.filter.longest_cds.complete.good_distance.gff

5. Remove redundancy
To avoid to bias the training and give an exhaustive view of the diversity of gene we have to remove the ones that are too similar to each other. In order to do so, we translate our coding genes into proteins, format the protein fasta file to be able to run a recursive blast and then filter them.

gff3_sp_extract_sequences.pl -g filtercodingGeneFeatures.filter.longest_cds.complete.good_distance.gff -f genome.fa -o protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa  

makeblastdb -in protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa -dbtype prot   

blastp -query protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa -db 
protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa -outfmt 6 -out blast_recursive/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa.blast_recursive  

gff3_sp_filter_by_mrnaBlastValue_bioperl.pl --gff filter/codingGeneFeatures.filter.longest_cds.complete.good_distance.gff --blast blast_recursive/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa.blast_recursive --outfile nonredundant/codingGeneFeatures.nr.gff

Voila, nonredundant/codingGeneFeatures.nr.gff contains your nice gene set. You need several hundred genes to train properly an ab initio tool. The more you have, the better is... but you kind of reach a plateau around 700 genes.

tutorial gene genome • 60 views
ADD COMMENTlink written 6 days ago by Juke-342.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1105 users visited in the last hour