Question: Less and less genes predicted with each iteration of SNAP/MAKER
gravatar for mrmrwinter
5 days ago by
University of Hull
mrmrwinter0 wrote:


I am annotating a de novo genome using MAKER.

I first ran maker with est and protein information from a closely related species, with est2genome and protein2genome on.

I then ran MAKER with SNAP switched on, using the output of the previous step as input for snap

I then ran MAKER with SNAP switched on 3 more times.

Each time the number of predicted genes decreased. The first run predicted ~40,000 Second run ~12,000 Third run ~1200....

This cant be correct surely? I am expecting around 10,000-20,000 for my organism.

Sorry i am new to gene prediction and annotation. What i am asking is, why is SNAP drastically reducing the number of predicted genes with each iteration?

Should i just take the nearest number to what i expected and proceed into Augustus?


ADD COMMENTlink written 5 days ago by mrmrwinter0

You should visualise all annotations along with protein2genome and est2genome tracks within a genome browser e.g Jbrowse. you will probably understand what is going on.

ADD REPLYlink written 5 days ago by Juke-343.6k

Unfortunately, sometimes you can "overtrain" your ab initio gene predictors. More information can be found through searches through the MAKER Developer Google group!forum/maker-devel. I really don't have much experience with SNAP, rather Augustus. Training Augustus well, is actually very difficult. Sometimes, BUSCO does a better job of initial training of Augustus and retraining with MAKER derived evidence actually makes the subsequent Augustus ab initio models worse. I don't know how test the sensitivity and specificity of SNAP ab initio models, but I wrote about how to do this for Augustus ab initio models and how I trained MAKER for a dromedary camel in the "analysis-steps-for-manuscript.txt" available from the following Dryad Repostiory:

ADD REPLYlink modified 5 days ago • written 5 days ago by jean.elbers1.3k

I don't completely agree with you about training Augustus with BUSCO. I had tested several times and always found that training Augustus with BUSCO rather with results from MAKER evidence-based annotation is worse. I tested it with BUSCO3 and tested again recently with BUSCO4 thinking now it could be similar results or even better than the MAKER approach but it is still not the case.

From MAKER evidence-based annotation I wrote an explanation of the workflow to select the best gene models here: gene set filter/selection for training ab initio annotation tools We automated the workflow a pipeline (recently converted from bpipe to nextflow) to train specifically Augutsus (and snap using the same selected gene models). You can find it here:

The difference between training Augustus within BUSCO4 and MAKER is less big but in my sense it is still worse. Here example of result on insect:

annotation_type number_of_gene  number_of_mRNA  busco4_result(endopterygota_odb10)
abinitio augustus   21887   21887   C:84.6%[S:83.5%,D:1.1%],F:4.6%,M:10.8%,n:2124
abinitio augustus-busco-trained 20943   20943   C:81.8%[S:80.5%,D:1.3%],F:6.1%,M:12.1%,n:2124

In this result I even ran MAKER using only proteins... when I use species-specific transcriptomes the Augustus training using MAKER result is even better.

ADD REPLYlink modified 5 days ago • written 5 days ago by Juke-343.6k

@Juke-34 Thank you for the links. You are probably doing a much better job than I have done with training Augustus with MAKER predictions. I have always found the opposite between BUSCO and MAKER training Augustus, but that is at least for mammals and one turtle (Kemp's Ridley sea turtle, marsh rice rat, garden warbler, different camel species) training Augustus with BUSCO using the more comprehensive odb9 databases did not work well and found the best results with training Augustus with BUSCO using eukaryota_odb9. I am basing the "better" based on the specificity and sensitivity results reported by Augustus with such as command:

augustus --species=BUSCO_dromedary_eukaryota_odb9

The training sets come from running the predictions from MAKER run through, see Step 25 from the above mentioned analysis-steps-for-manuscript.txt for more details.

ADD REPLYlink modified 5 days ago • written 5 days ago by jean.elbers1.3k

Interesting thank you. Great work by the way.
There are many different ways to select gene models for training purpose. It is true that what you do in the work you mention is quite light (only by AED score from what I understand). I understand that BUSCO training was better in this case. In our protocol we try to follow the recommendations made from the Augustus group to select best gene models as possible... so there are few more steps.

ADD REPLYlink modified 5 days ago • written 5 days ago by Juke-343.6k

Well, I had also tried many different things- some similar steps to your pipeline (ex: redundancy removal) and other things (ex:AED filtering, redundancy removal, and randomization), etc. Still to no avail.

ADD REPLYlink written 5 days ago by jean.elbers1.3k

Ok I have not seen the redundancy removal, this is one of the most important step

ADD REPLYlink written 5 days ago by Juke-343.6k

No, you were correct that I only showed AED-only filtering steps in the analysis steps, but there were some attempts at combining AED filtering, redundancy removal, and randomization as well that I didn't document but tried.

ADD REPLYlink written 5 days ago by jean.elbers1.3k

Does overtraining result in fewer predictions then?

Does this mean i should take the results of an earlier iteration? One nearer what would be expected?

ADD REPLYlink written 5 days ago by mrmrwinter0

As mentioned by @jean.elbers probably yes. I really advise you to to visualise the results to make sense of it. You will probably see in your case that snap prediction tends to merge loci.

ADD REPLYlink written 5 days ago by Juke-343.6k

I'm installing JBrowse as i type

Thanks again Juke

ADD REPLYlink written 5 days ago by mrmrwinter0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1066 users visited in the last hour