Less and less genes predicted with each iteration of SNAP/MAKER
0
0
Entering edit mode
2.2 years ago
mrmrwinter ▴ 30

Hi

I am annotating a de novo genome using MAKER.

I first ran maker with est and protein information from a closely related species, with est2genome and protein2genome on.

I then ran MAKER with SNAP switched on, using the output of the previous step as input for snap

I then ran MAKER with SNAP switched on 3 more times.

Each time the number of predicted genes decreased. The first run predicted ~40,000 Second run ~12,000 Third run ~1200....

This cant be correct surely? I am expecting around 10,000-20,000 for my organism.

Sorry i am new to gene prediction and annotation. What i am asking is, why is SNAP drastically reducing the number of predicted genes with each iteration?

Should i just take the nearest number to what i expected and proceed into Augustus?

Thanks

maker annotation snap gene prediction genomics • 1.8k views
0
Entering edit mode

You should visualise all annotations along with protein2genome and est2genome tracks within a genome browser e.g Jbrowse. you will probably understand what is going on.

0
Entering edit mode

Unfortunately, sometimes you can "overtrain" your ab initio gene predictors. More information can be found through searches through the MAKER Developer Google group https://groups.google.com/forum/#!forum/maker-devel. I really don't have much experience with SNAP, rather Augustus. Training Augustus well, is actually very difficult. Sometimes, BUSCO does a better job of initial training of Augustus and retraining with MAKER derived evidence actually makes the subsequent Augustus ab initio models worse. I don't know how test the sensitivity and specificity of SNAP ab initio models, but I wrote about how to do this for Augustus ab initio models and how I trained MAKER for a dromedary camel in the "analysis-steps-for-manuscript.txt" available from the following Dryad Repostiory:

1
Entering edit mode

I don't completely agree with you about training Augustus with BUSCO. I had tested several times and always found that training Augustus with BUSCO rather with results from MAKER evidence-based annotation is worse. I tested it with BUSCO3 and tested again recently with BUSCO4 thinking now it could be similar results or even better than the MAKER approach but it is still not the case.

From MAKER evidence-based annotation I wrote an explanation of the workflow to select the best gene models here: gene set filter/selection for training ab initio annotation tools We automated the workflow a pipeline (recently converted from bpipe to nextflow) to train specifically Augutsus (and snap using the same selected gene models). You can find it here: https://github.com/NBISweden/pipelines-nextflow.

The difference between training Augustus within BUSCO4 and MAKER is less big but in my sense it is still worse. Here example of result on insect:

annotation_type number_of_gene  number_of_mRNA  busco4_result(endopterygota_odb10)
abinitio augustus   21887   21887   C:84.6%[S:83.5%,D:1.1%],F:4.6%,M:10.8%,n:2124
abinitio augustus-busco-trained 20943   20943   C:81.8%[S:80.5%,D:1.3%],F:6.1%,M:12.1%,n:2124


In this result I even ran MAKER using only proteins... when I use species-specific transcriptomes the Augustus training using MAKER result is even better.

0
Entering edit mode

@Juke-34 Thank you for the links. You are probably doing a much better job than I have done with training Augustus with MAKER predictions. I have always found the opposite between BUSCO and MAKER training Augustus, but that is at least for mammals and one turtle (Kemp's Ridley sea turtle, marsh rice rat, garden warbler, different camel species) training Augustus with BUSCO using the more comprehensive odb9 databases did not work well and found the best results with training Augustus with BUSCO using eukaryota_odb9. I am basing the "better" based on the specificity and sensitivity results reported by Augustus with such as command:

augustus --species=BUSCO_dromedary_eukaryota_odb9 training.gb.test1


The training sets come from running the predictions from MAKER run through autoAug.pl, see Step 25 from the above mentioned analysis-steps-for-manuscript.txt for more details.

0
Entering edit mode

Interesting thank you. Great work by the way.
There are many different ways to select gene models for training purpose. It is true that what you do in the work you mention is quite light (only by AED score from what I understand). I understand that BUSCO training was better in this case. In our protocol we try to follow the recommendations made from the Augustus group to select best gene models as possible... so there are few more steps.

0
Entering edit mode

Well, I had also tried many different things- some similar steps to your pipeline (ex: redundancy removal) and other things (ex:AED filtering, redundancy removal, and randomization), etc. Still to no avail.

0
Entering edit mode

Ok I have not seen the redundancy removal, this is one of the most important step

0
Entering edit mode

No, you were correct that I only showed AED-only filtering steps in the analysis steps, but there were some attempts at combining AED filtering, redundancy removal, and randomization as well that I didn't document but tried.

0
Entering edit mode

What are your thought on using the BRAKER2 pipeline to train Augustus?

Secondly I assume your use of MAKER evidence is using RNA supported MAKER and not SNAP based support as you use SNAP later on in your ab inito pipeline?

0
Entering edit mode

I am not sure about comparing BRAKER2 with just proteins to train Augustus, but in my experience with a Dipteran fly, BRAKER2 to train Augustus with arthropoda ortho db 10 proteins and species-specific RNA-Seq reads, BRAKER2-trained Augustus was comparable to using the BRAKER2 output processed MAKER used to train Augustus with the pipeline from https://github.com/NBISweden/pipelines-nextflow. Did not try running MAKER alone then the above mentioned pipeline to compare to BRAKER2-trained Augustus

0
Entering edit mode

I've had a similar problem. My de novo genome and transcriptome assemblies recover ~98% BUSCOs, but my annotations are only retrieving ~2% after 3 rounds of Maker. For the protein input, I tried both a proteome of a close relative and the Uniprot/Swissprot omnibus, and ended up sticking with the latter since it gave marginally better results.

I used both Augustus and SNAP training. For Augustus, I tried training with the initial Maker round (and subsequent rounds), and when that didn't improve results I tried a predefined Augustus model based on a close model organism. But I am still only recovering < 3% BUSCOs.

I'm looking into the Nextflow pipeline, but am having trouble installing all the dependencies (e.g. tcl, go, modules, singularity) and setting up the valid project pathway (I've tried various pathways/structures). However, it seems like Nextflow's ab initio pipeline simply trains Augustus on a previous Maker gff3, so is this fundamentally any different than what I've already tried?

1
Entering edit mode

For people who have this problem in the future, I seem to have resolved it by creating my own custom repeat library to use with Maker. I'm still in the early rounds trained with Augustus, but recovery of BUSCOs has already increased to nearly 80%. To produce a de novo repeat library, I used the EDTA workflow, which combines several packages, including RepeatModeler and LTRharvest.

0
Entering edit mode

Does overtraining result in fewer predictions then?

Does this mean i should take the results of an earlier iteration? One nearer what would be expected?

1
Entering edit mode

As mentioned by @jean.elbers probably yes. I really advise you to to visualise the results to make sense of it. You will probably see in your case that snap prediction tends to merge loci.

0
Entering edit mode

I'm installing JBrowse as i type

Thanks again Juke