Tutorial: gene set filter/selection for training ab initio annotation tools
6
gravatar for Juke34
9 months ago by
Juke344.1k
Sweden
Juke344.1k wrote:

Here I show how to filter an annotation (often evidence-based) to create a golden set of genes in order to train ab-initio annotation tools.

Prerequisite:

  • an annotation in gff or gtf
  • AGAT (Another Gff Analysis Toolkit)
  • blast
  • to make it easier a nice folder structure:
    • mkdir filter
    • mkdir protein
    • mkdir nonredundant
    • mkdir blast_recursive

1. Select protein coding genes
First we select only the protein coding genes from the annotation file and remove all other type of level2 features: tRNA, snoRNA, etc.

agat_sp_split_by_level2_feature.pl --gff annotation_evidence_based.gff -o splitByLevel2  
ln -s splitByLevel2/mrna.gff

2. Optional (Only if Maker annotation): filter by score
Next step, we need to filter the best genes. We will keep gene with AED score under a certain score (e.g 0.3).

agat_sp_filter_feature_by_attribute_value.pl --gff mrna.gff  --value 0.3 -a _AED -t ">=" -o filter/codingGeneFeatures.filter.gff

3. Longest isoform per mRNA
Now we filter mRNAs to keep only the longest isoform when several are avaialble.

agat_sp_keep_longest_isoform.pl --gff filter/codingGeneFeatures.filter.gff -o filter/codingGeneFeatures.filter.longest_cds.gff

4. Complete genes
We then check that the gene models are complete (has a start and stop codon), and remove the incomplete ones.

agat_sp_filter_incomplete_gene_coding_models.pl --gff filter/codingGeneFeatures.filter.longest_cds.gff -f genome.fa -o filter/codingGeneFeatures.filter.longest_cds.complete.gff

4. Distance from neighboring genes
We can also filter the gene models by distance from neighboring genes in order to be sure to have nice intergenic region wihtout genes in it in order to train properly intergenic parameters. (default 500 bp)

agat_sp_filter_by_locus_distance.pl --gff filter/codingGeneFeatures.filter.longest_cds.complete.gff -o filter/codingGeneFeatures.filter.longest_cds.complete.good_distance.gff

5. Remove redundancy
To avoid to bias the training and give an exhaustive view of the diversity of gene we have to remove the ones that are too similar to each other. In order to do so, we translate our coding genes into proteins, format the protein fasta file to be able to run a recursive blast and then filter them.

agat_sp_extract_sequences.pl --gff filtercodingGeneFeatures.filter.longest_cds.complete.good_distance.gff -f genome.fa --aa -o protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa  

makeblastdb -in protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa -dbtype prot   

blastp -query protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa -db 
protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa -outfmt 6 -out blast_recursive/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa.blast_recursive  

agat_sp_filter_by_mrnaBlastValue.pl --gff filter/codingGeneFeatures.filter.longest_cds.complete.good_distance.gff --blast blast_recursive/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa.blast_recursive --outfile nonredundant/codingGeneFeatures.nr.gff

Voila, nonredundant/codingGeneFeatures.nr.gff contains your nice gene set. You need several hundred genes to train properly an ab initio tool. The more you have, the better is... but you kind of reach a plateau around 700 genes.

tutorial gene genome • 882 views
ADD COMMENTlink modified 9 weeks ago • written 9 months ago by Juke344.1k
1

Hi Jacques, Thank you for sharing your protocol. I just wanted to check with you step 5. Should you not request agat_sp_extract_sequences.pl to translate using the --aa switch? Thanks in advance

ADD REPLYlink written 9 weeks ago by victaphanta10

Yes you're right, thank you for pointing it.

ADD REPLYlink written 9 weeks ago by Juke344.1k

Hi, I have the below GFF3 file and I would like to remove mRNAs which do not have a description i.e. do not contain Note. Does GAAS remove those mRNAs? If not then do you know any other tool?

NbV1Ch08        AUGUSTUS        gene    60876   63944   0.03    +       .       ID=g2
NbV1Ch08        AUGUSTUS        mRNA    60876   63944   0.03    +       .       ID=g2.t1;Note=B3 domain-containing protein Os03g0120900;Parent=g2
NbV1Ch08        AUGUSTUS        transcription_start_site        60876   60876   .       +       .       Parent=g2.t1
NbV1Ch08        AUGUSTUS        five_prime_utr  60876   61072   0.19    +       .       ID=g2.t1.5UTR1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    60876   61072   .       +       .       ID=g2.t1.exon1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        five_prime_utr  61673   61732   0.37    +       .       ID=g2.t1.5UTR2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    61673   63449   .       +       .       ID=g2.t1.exon2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        start_codon     61733   61735   .       +       0       Parent=g2.t1
NbV1Ch08        AUGUSTUS        CDS     61733   62974   0.54    +       0       ID=g2.t1.CDS1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        stop_codon      62972   62974   .       +       0       Parent=g2.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 62975   63449   1       +       .       ID=g2.t1.3UTR1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 63565   63944   0.27    +       .       ID=g2.t1.3UTR2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    63565   63944   .       +       .       ID=g2.t1.exon3;Parent=g2.t1
NbV1Ch08        AUGUSTUS        transcription_end_site  63944   63944   .       +       .       Parent=g2.t1
NbV1Ch08        AUGUSTUS        gene    64722   65524   0.32    -       .       ID=g3
NbV1Ch08        AUGUSTUS        mRNA    64722   65524   0.32    -       .       ID=g3.t1;Parent=g3
NbV1Ch08        AUGUSTUS        transcription_end_site  64722   64722   .       -       .       Parent=g3.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 64722   64792   0.77    -       .       ID=g3.t1.3UTR1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        exon    64722   65524   .       -       .       ID=g3.t1.exon1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        stop_codon      64793   64795   .       -       0       Parent=g3.t1
NbV1Ch08        AUGUSTUS        CDS     64793   65494   0.44    -       0       ID=g3.t1.CDS1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        start_codon     65492   65494   .       -       0       Parent=g3.t1

Thank you in advance.

ADD REPLYlink written 9 months ago by Ric290

I think I have not develop any script to perform such filtering within the GAAS toolkit, because a simple awk command can do the trick:

awk '{if(!($3=="mRNA"  && $9 ~ "Note"))print $0}' file.gff > filtered_file.gff
ADD REPLYlink written 9 months ago by Juke344.1k

Thank you. However, using the above command will ignore the below sub features:

transcription_start_site
five_prime_utr
exon
five_prime_utr
exon
start_codon
CDS
stop_codon
three_prime_utr
three_prime_utr
exon
transcription_end_site
ADD REPLYlink written 9 months ago by Ric290

What do you mean? Those features should still be in the output.

ADD REPLYlink written 9 months ago by Juke344.1k

I would like to keep those features because they mRNA contain Note (writing into keep.gff3 file)

NbV1Ch08        AUGUSTUS        gene    60876   63944   0.03    +       .       ID=g2
NbV1Ch08        AUGUSTUS        mRNA    60876   63944   0.03    +       .       ID=g2.t1;Note=B3 domain-containing protein Os03g0120900;Parent=g2
NbV1Ch08        AUGUSTUS        transcription_start_site        60876   60876   .       +       .       Parent=g2.t1
NbV1Ch08        AUGUSTUS        five_prime_utr  60876   61072   0.19    +       .       ID=g2.t1.5UTR1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    60876   61072   .       +       .       ID=g2.t1.exon1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        five_prime_utr  61673   61732   0.37    +       .       ID=g2.t1.5UTR2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    61673   63449   .       +       .       ID=g2.t1.exon2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        start_codon     61733   61735   .       +       0       Parent=g2.t1
NbV1Ch08        AUGUSTUS        CDS     61733   62974   0.54    +       0       ID=g2.t1.CDS1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        stop_codon      62972   62974   .       +       0       Parent=g2.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 62975   63449   1       +       .       ID=g2.t1.3UTR1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 63565   63944   0.27    +       .       ID=g2.t1.3UTR2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    63565   63944   .       +       .       ID=g2.t1.exon3;Parent=g2.t1
NbV1Ch08        AUGUSTUS        transcription_end_site  63944   63944   .       +       .       Parent=g2.t1

On the other hand, I would like to reject those features because mRNA does not contain Note (writing into reject.gff3 file)

NbV1Ch08        AUGUSTUS        gene    64722   65524   0.32    -       .       ID=g3
NbV1Ch08        AUGUSTUS        mRNA    64722   65524   0.32    -       .       ID=g3.t1;Parent=g3
NbV1Ch08        AUGUSTUS        transcription_end_site  64722   64722   .       -       .       Parent=g3.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 64722   64792   0.77    -       .       ID=g3.t1.3UTR1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        exon    64722   65524   .       -       .       ID=g3.t1.exon1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        stop_codon      64793   64795   .       -       0       Parent=g3.t1
NbV1Ch08        AUGUSTUS        CDS     64793   65494   0.44    -       0       ID=g3.t1.CDS1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        start_codon     65492   65494   .       -       0       Parent=g3.t1
ADD REPLYlink written 9 months ago by Ric290

Ok I understand, then awk is not well suited. Unfortunalty I do not have script yet to perform this task.

ADD REPLYlink written 9 months ago by Juke344.1k

It seems to work for now after sed 's|,AT.*-Protein;||g' TAIR10_GFF3_genes-fix1.rm_rubbish_rm_protein.gff > TAIR10_GFF3_genes-fix1.rm_rubbish_rm_protein_rm_id.gff.

ADD REPLYlink written 9 months ago by Ric290

It seems you mixed up your answer with that post isn't it?

ADD REPLYlink written 9 months ago by Juke344.1k

Thank you for noticing.

ADD REPLYlink written 9 months ago by Ric290

I have now a script to perform this task properly in AGAT called ‘agat_sp_filter_feature_by_attribute_presence.pl’

ADD REPLYlink written 4 months ago by Juke344.1k

Hi Thanks for your guide !

I am working to get a nice set of gene after my 1st round of maker. In that case, what is annotation_evidence_based.gff ?

ADD REPLYlink written 5 months ago by Picasa530
2

It's your output of maker after using:

...../src/maker/bin/gff3_merge -d your_name_master_datastore_index.log

He called it evidence_based because your first maker run should be using protein and RNA (ESTs) evidence to detect genes. The next maker round you do will be without these evidences (you should disable them) and only with the model .hmm file that you create with SNAP or Augustus.

Cheers,

ADD REPLYlink written 5 months ago by ricardoguerreiro212160

I didn't even realize this was necessary. The maker tutorials don't mention it at all, they just proceed to SNAP modelling with all genes, using only "fathom" as quality control. Does this make a very big difference in results? Where can I learn about it?

Cheers,
P.S: Thank you so much for sharing your knowledge so well!

ADD REPLYlink modified 5 months ago • written 5 months ago by ricardoguerreiro212160
1

By experience the approach I mention here give quite similar results (at least for Augustus), excepted it needs only one round of MAKER, that make the approach much more efficient in term of time.

I have seen you message on the MAKER mailing list, you can use alt_ESTs only but not with est2genome=1 because no gene prediction are made with alt_ESTs (it is not obvious but written somewhere in one of their paper). What you can do is to use alt_EST data as EST too. I often do that, it might help if it is not too diverged.But you will not have many prediction ET-based like that.

ADD REPLYlink modified 4 months ago • written 4 months ago by Juke344.1k
0
gravatar for Juke34
9 weeks ago by
Juke344.1k
Sweden
Juke344.1k wrote:

For whom interested we also have a nextflow pipeline for it that goes even beyond ( does the training too ).
You can find it here: https://github.com/NBISweden/pipelines-nextflow/tree/master/AbinitioTraining

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by Juke344.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 739 users visited in the last hour