Question

Continuing issues with MAKER and SNAP

0

Entering edit mode

2.2 years ago

mrmrwinter ▴ 30

Hi,

I am trying to annotate my assembly using MAKER. My plan was to run it first with EST2genome and protein2genome, using fastas to inform, then use the resulting HMM to train SNAP, then continue onto iterative SNAP runs, Augustus, etc.

My problem comes when i check the output of the first SNAP run, which looks like this:

##gff-version 3
19__unscaffolded    .   contig  1   5673700 .   .   .   ID=19__unscaffolded;Name=19__unscaffolded
19__unscaffolded    snap    match   43  5639198 242017.148  +   .   ID=19__unscaffolded:hit:0:4.5.0.55;Name=snap-19__unscaffolded-abinit-gene-55.0-mRNA-1;target_length=5673700
19__unscaffolded    snap    match_part  43  89  16.744  +   .   ID=19__unscaffolded:hsp:0:4.5.0.55;Parent=19__unscaffolded:hit:0:4.5.0.55;Target=snap-19__unscaffolded-abinit-gene-55.0-mRNA-1 1 47 +;Gap=M47
19__unscaffolded    snap    match_part  172 221 20.284  +   .   ID=19__unscaffolded:hsp:1:4.5.0.55;Parent=19__unscaffolded:hit:0:4.5.0.55;Target=snap-19__unscaffolded-abinit-gene-55.0-mRNA-1 48 97 +;Gap=M50
19__unscaffolded    snap    match_part  293 361 21.730  +   .   ID=19__unscaffolded:hsp:2:4.5.0.55;Parent=19__unscaffolded:hit:0:4.5.0.55;Target=snap-19__unscaffolded-abinit-gene-55.0-mRNA-1 98 166 +;Gap=M69
19__unscaffolded    snap    match_part  405 530 36.538  +   .   ID=19__unscaffolded:hsp:3:4.5.0.55;Parent=19__unscaffolded:hit:0:4.5.0.55;Target=snap-19__unscaffolded-abinit-gene-55.0-mRNA-1 167 292 +;Gap=M126

... and on.

When i try and use maker2zff, it generates empty genome.ann and genome.dna files. When i use the GAAS package, and the script gaas_merge_outputs_from_datastore.pl, I get a gff with no actual genes (CDS,transcript,exon). Grepping for these also results in zero hits..

I thought this could be a problem with the SNAP installation/path in MAKER, so installed SNAP separately through conda, which ran fine and generated exon predictions, but the "gff" output of SNAP is not gff, and cannot be read into maker2zff to generate a .hmm for further annotation.

I have had this exact same issue on two different assemblies now from distinct phyla with distinct genomic architecture.

My question is, has anyone had issues like this, where either SNAP in MAKER generates no gene models, or where they have succesfully ran SNAP and fed the output back into MAKER? How do i improve the output of a MAKER SNAP run?

Also, what are peoples opinions on MAKER? I have been trying to get MAKER to run without near-constant troubleshooting for almost two years now, and a look at the forums shows me i'm far from the only one.

Cheers

Edit 1: Since posting this, i wondered if it could be my singularity install of MAKER/SNAP that could be causing issues, so i ran SNAP again, this time from a MAKER conda install that i have had succesful SNAP runs from in the past. This also failed to predict any models other than snap match and snap match_part.

SNAP zff gff MAKER annotation • 1.4k views

ADD COMMENT • link 2.1 years ago by mrmrwinter ▴ 30

0

Entering edit mode

May you could give MOSGA a chance for user-friendly genome annotation.

ADD REPLY • link 2.2 years ago by BioinformaticBird ▴ 110

0

Entering edit mode

I have had issues with running MOSGA in the past, as well as that it is not as reproducible (with it being web based), but wil give it a try to get past the SNAP stage.

ADD REPLY • link 2.2 years ago by mrmrwinter ▴ 30

score 1 · Answer 1 · 2022-03-08

1

Entering edit mode

2.2 years ago

Juke34 8.6k

Have a look here: https://agat.readthedocs.io/en/latest/topological-sorting-of-gff-features.html The file containing match/match_part is not the annotation, you should use the maker_annotation.gff file to train SNAP. You have to add first the fasta sequence to the file properly.

I get a gff with no actual genes

If you only have this file it means you forgot to activate the annotation when running MAKER (in the maker_opts.ctl file).

ADD COMMENT • link 2.2 years ago by Juke34 8.6k

0

Entering edit mode

Hi Juke, thanks for your help.

I'm not sure i understand. The file containing match/match_part is the result of running MAKER with the following ctl settings.

#-----Genome (these are always required)
genome=/home/531734/mike/javanica/javanica_assemblies/hiCversion2.trimmed_headers.fasta.masked #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est= #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=  #protein sequence file in fasta format (i.e. from mutiple organisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for DFam masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=0 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm=/home/531734/mike/29_maker_hiCversion2/3.1_hiCversion2.maker.output/my_genome.hmm #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
snoscan_meth= #-O-methylation site fileto have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
allow_overlap=0 #allowed gene overlap fraction (value from 0 to 1, blank for default)

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=112 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=49000 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
min_intron=20 #minimum intron length (used for alignment polishing)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP=/home/531734/mike/29_maker_hiCversion2/tmp/ #specify a directory other than the system default temporary directory for temporary files

As far as i can tell, everything needed to perform SNAP annotation has been switched on. I have also checked that the fasta sequence is at the end of the gff i used to create the hmm for SNAP (maker_run.all.gff, the output of gff3_merge ont he exonorate datastore).

My process so far has been, for two separate species, 1, run MAKER with EST2GENOME and prot2genome switched on, and provide transcript and protein information, 2, extract gff using gff3_merge, generate genome.ann and genome.dna using fathom, validate and remove errors, then create a hmm using hmm-assembler.pl, then 3, running MAKER with the above ctl file, using the aforementioned HMM as input.

Please let me know if i'm doing something wrong. I have succesfully ran MAKER and SNAP in the past, and cant seem to see what im doing differently from last time that is causing it to fail.

Many thanks

Edit: I looked back into my lab notebooks from when i have ran MAKER/SNAP successfully before, and they had been ran using the conda install of MAKER, not the singularity container i am currently using. When i replicate this run usng SNAP conda however, i get the same results. Running gaas_maker_merge_outputs_from_datastore.pl produces a folder with an empty maker_annotation_stats.txt file, and no maker_annotation.gff file.

ADD REPLY • link 2.2 years ago by mrmrwinter ▴ 30

1

Entering edit mode

In this round of annotation you did activate SNAP but:

you do not provide any evidence (protein or EST in fasta or gff format)
you ask MAKER to not report annotation without any evidence support (keep_preds=0)

Either provide evidence, or change keep_preds=0 by keep_preds=1

ADD REPLY • link 2.2 years ago by Juke34 8.6k

0

Entering edit mode

Ah brilliant. Thanks Juke, I'm getting models now.

ADD REPLY • link 2.1 years ago by mrmrwinter ▴ 30