Hi everyone, I'm starting with plant annotation and I'm using maker for the first time, however I have some doubts on how to configure my maker_opts.ctl file for my first round. I intend to do this first round with my previously masked genome (RepeatModeler and RepeatMasker), RNA-seq and proteins. My doubt is mainly in the masking section, I don't know what would be the correct way to set these parameters if I'm already providing a softmask genome:
#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask= #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
When I set it as follows:
#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/NFS/LUSTRE/storage/data/software/maker-2.31/maker/data/te_proteins.fasta #provide a fasta file of transposable eleme$rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=0 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
or softmask=1
I get the following error in all my scaffolds:
Processing run.log file...
MAKER WARNING: The file Plant_aura_var.Colimex.maker.output/Plant_aura_var.Colimex_datastore/B4/1C/scaffold_268//theVoid.scaffold_268/0/scaffold_268.0.all.rb.out
did not finish on the last run and must be erased
examining contents of the fasta file and run log
I would greatly appreciate your response.
Hi colindaven ,
Thanks for your reply, the problem I have is actually about how to configure my make_opts file for my first round. My plan is to run a first round with RNA-seq and proteins, a second with training with SNAP and Augustus and the previous gff and the third using the above. My doubts are about masking, is it better to use a softmask genome at the beginning and disable the masking options in maker? or maybe put a non-mask genome and a gff file of repetitions? Do you have any recommendations? I have many conflicts in the
#-----Repeat Masking
sectionThank you very much.
Hi san,
Have you got an answer? I would like to put a non-mask genome and a gff file of repetitions as you, will this option work well?
Hi, I ultimately decided not to use Maker, but I think it works as you described. Perhaps this protocol might help:
Please see the most up-to-date version of this protocol on my blog at https://darencard.net/blog/.
Genome Annotation using MAKER
MAKER is a great tool for annotating a reference genome using empirical and ab initio gene predictions. GMOD, the umbrella organization that includes MAKER, has some nice tutorials online for running MAKER. However, these were quite simplified examples and it took a bit of effort to wrap my head completely around everything. Here I will describe a de novo genome annotation for Boa constrictor in detail, so that there is a record and that it is easy to use this as a guide to annotate any genome.
Software & Data
Software prerequisites:
Raw data/resources:
Boa_constrictor_SGA_7C_scaffolds.fa
: The de novo Boa constrictor reference genome assembled as part of Assemblathon2. This is the SGA team assembly (snake 7C). This is in FASTA format.Boa_Trinity_15May2015.Txome_assembly.fasta
: A de novo transcriptome assembly created using Trinity and mRNAseq data from 10 Boa tissues. It may make sense to do some post-processing of this assembly, but I did not do so here. Trinity is pretty easy to run so I won't describe that here. This is in FASTA format.Running commands:
I've become accustomed to running most programs (especially those that take hours/days) using
screen
. I would create new screens for each command below. I also like to usetee
to keep track of run logs, as you will see in the commands below.Repeat Annotation
1. De Novo Repeat Identification
The first, and very important, step to genome annotation is identifying repetitive content. Existing libraries from Repbase or from internal efforts are great, but it is also important to identify repeats de novo from your reference genome using
RepeatModeler
. This is pretty easy to do and normally only takes a couple days using 8-12 cores.Further steps can be taken to annotate the resulting library, but the most important reason for this library is for downstream gene prediction. In this example, this Boa library was combined with several other snakes and annotated.
2. Full Repeat Annotation
Depending on the species, the de novo library can be fed right into MAKER. We normally do more complex repeat identification with snakes so I will describe that here.
First, we mask using a currated BovB/CR1 line library to overcome a previously-identified issue with the Repbase annotation. This probably won't be necessary in other species.
Then the maksed FASTA from this search can be used as input for the next search, using the
tetrapoda
library from Repbase. I also normally rename the outputs after each round so they are more representative of what they contain.And then I finished using two more rounds using a library of known and unknown snake repeats (including those from Boa). These rounds were split so that known elements would be preferentially annotated over unknown, to the degree possible.
Finally, results from each round must be analyzed together to produce the final repeat annotation
Finally, in order to feed these repeats into MAKER properly, we must separate out the complex repeats (more info on this below).
Now we have the prerequesite data for running MAKER.
3. Initial MAKER Analysis
MAKER is pretty easy to get going and relies on properly completed control files. These can be generated by issuing the command
maker -CTL
. The only control file we will be themaker_opts.ctl
file. In this first round, we will obviously providing the data files for the repeat annotation (rm_gff
), the transcriptome assembly (est
), and for the other Squamate protein sequences (protein
). We will also set themodel_org
to 'simple' so that only simple repeats are annotated (along withRepeatRunner
). Here is the full control file for reference.Specifying the GFF3 annotation file for the annotated complex repeats (
rm_gff
) has the effect of hard masking these repeats so that they do not confound our ability to identify coding genes. We let MAKER identify simple repeats internally, since it will soft mask these, allowing them to be available for gene annotation. This isn't a typical approach but has to be done if you want to do more than one succeessive round of RepeatMasker. I verified this would work with the MAKER maintainers here.Two other important settings are
est2genome
andprotein2genome
, which are set to 1 so that MAKER gene predictions are based on the aligned transcripts and proteins (the only form of evidence we currently have). I also construct the MAKER command in aBash
script so it is easy to run and keep track of.Then we run MAKER.
Given MAKER will be using BLAST to align transcripts and proteins to the genome, this will take at least a couple days with 12 cores. Speed is a product of the resources you allow (more cores == faster) and the assembly quality (smaller, less contiguous scaffolds == longer). We conclude by assembling together the GFF and FASTA outputs.
4. Training Gene Prediction Software
Besides mapping the empirical transcript and protein evidence to the reference genome and repeat annotation (not much of this in our example, given we've done so much up front), the most important product of this MAKER run is the gene models. These are what is used for training gene prediction software like
augustus
andsnap
.SNAP
SNAP is pretty quick and easy to train. Issuing the following commands will perform the training. It is best to put some thought into what kind of gene models you use from MAKER. In this case, we use models with an AED of 0.25 or better and a length of 50 or more amino acids, which helps get rid of junky models.
Augustus
Training Augustus is a more laborious process. Luckily, the recent release of
BUSCO
provides a nice pipeline for performing the training, while giving you an idea of how good your annotation already is. If you don't want to go this route, there are scripts provided with Augustus to perform the training. First, theParallel::ForkManager
module for Perl is required to runBUSCO
with more than one core. You can easily install it before the first time you useBUSCO
by runningsudo apt-get install libparallel-forkmanager-perl
.This probably isn't an ideal training environment, but appears to work well. First, we must put together training sequences using the gene models we created in our first run of MAKER. We do this by issuing the following command to excise the regions that contain mRNA annotations based on our initial MAKER run (with 1000bp on each side).
There are some important things to note based on this approach. First is that you will likely get warnings from BEDtools that certain coordinates could not be used to extract FASTA sequences. This is because the end coordinate of a transcript plus 1000 bp is beyond the total length of a given scaffold. This script does account for transcripts being within the beginning 1000bp of the scaffold, but there was no easy way to do the same with transcrpts within the last 1000bp of the scaffold. This is okay, however, as we still end up with sequences from thousands of gene models and BUSCO will only be searching for a small subset of genes itself.
While we've only provided sequences from regions likely to contain genes, we've totally eliminated any existing annotation data about the starts/stops of gene elements. Augustus would normally use this as part of the training process. However, BUSCO will essentially do a reannotation of these regions using BLAST and built-in HMMs for a set of conserved genes (hundreds to thousands). This has the effect of recreating some version of our gene models for these conserved genes. We then leverage the internal training that BUSCO can perform (the
--long
argument) to optimize the HMM search model to train Augustus and produce a trained HMM for MAKER. Here is the command we use to perform the Augustus training inside BUSCO.BUSCO.py -i Bcon_rnd2.all.maker.transcripts1000.fasta -o Bcon_rnd1_maker -l tetrapoda_odb9/ \ -m genome -c 8 --long -sp human -z --augustus_parameters='--progress=true'
In this case, we are using the Tetrapoda set of conserved genes (N = 3950 genes), so BUSCO will try to identify those gene using BLAST and an initial HMM model for each that comes stocked within BUSCO. We specify the
-m genome
option since we are giving BUSCO regions that include more than just transcripts. The initial HMM model we'll use is the human one (-sp human
), which is a reasonably close species. Finally, the--long
option tells BUSCO to use the initial gene models it creates to optimize the HMM settings of the raw human HMM, thus training it for our use on Boa. We can have this run in parallel on several cores, but it will still likely take days, so be patient.Once BUSCO is complete, it will give you an idea of how complete your annotation is (though be cautious, because we haven't filtered away known alternative transcripts that will be binned as duplicates). We need to do some post-processing of the HMM models to get them ready for MAKER. First, we'll rename the files within
run_Bcon_rnd1_maker/augustus_output/retraining_paramters
.We also need to rename the files cited within certain HMM configuration files.
Finally, we must copy these into the
$AUGUSTUS_CONFIG_PATH
species HMM location so they are accessible by Augustus and MAKER.5. MAKER With Ab Initio Gene Predictors
Now let's run a second round of MAKER, but this time we will have SNAP and Augustus run within MAKER to help create more sound gene models. MAKER will use the annotations from these two prediction programs when constructing its models. Before running, let's first recycle the mapping of empicial evidence we have from the first MAKER round, so we don't have to perform all the BLASTs, etc. again.
Then we will modify the previous control file, removing the FASTA sequences files to map and replacing them with the GFFs (
est_gff
,protein_gff
, andrm_gff
, respectively. We can also specify the path to the SNAP HMM and the species name for Augustus, so that these gene prediciton programs are run. We will also switchest2genome
andprotein2genome
to 0 so that gene predictions are based on the Augustus and SNAP gene models. Here is the full version of this control file.Then we can run MAKER, substituting this new control file, and summarize the output, as we did before.
6. Iteratively Running MAKER to Improve Annotation
One of the beauties of MAKER is that it can be run iteratively, using the gene models from the one round to train ab initio software to improve the inference of gene models in the next round. Essentially, all one has to do is repeat steps 4 and 5 to perform another round of annotation. The MAKER creators/maintainers recommend at least a couple rounds of ab initio software training and MAKER annotation (i.e., 3 rounds total) and returns start to diminish (at differing rates) thereafter. One needs to be careful not to overtrain Augustus and SNAP, so more rounds isn't necessarily always better. Below are a few ways of evaluating your gene models after successive rounds of MAKER to identify when you have sound models.
A. Count the number of gene models and the gene lengths after each round.
B. Visualize the AED distribution. AED ranges from 0 to 1 and quantifies the confidence in a gene model based on empirical evidence. Basically, the lower the AED, the better a gene model is likely to be. Ideally, 95% or more of the gene models will have an AED of 0.5 or better in the case of good assemblies. You can use this
AED_cdf_generator.pl
script to help with this.C. Run BUSCO one last time using the species Augustus HMM and take a look at the results (this will be quick since we are not training Augustus). Also, only include the transcript sequences (not the 1000 bp on each side) and be sure to take the best (i.e., longest) transcript for each gene so you aren't artificially seeding duplicates. You can also run it on the best protein sequence per gene instead. Your command will be some derivative of the following:
D. Visualize the gene models from Augustus, SNAP, and MAKER using a genome browser. JBrowse is a good option for this. You can essentially follow this guide to get this started. A helpful resource is this
gff2jbrowse.pl
script, which automates adding tracks to the browser based on the GFF output of your MAKER run. It is best to use 5-10 longer, gene dense scaffolds and visually inspect them. When SNAP and Augustus are well trained, their models should overlap pretty closely with the final MAKER models. Moreover, there will be spurious hits from SNAP and Augustus, but they are usually short, 1-2 exon annotations and don't have empirical support. You'll get a sense of a good annotation with some experience. Also, it is possible SNAP won't produce good results, depending on your organism, which the MAKER folks have pointed out in the past (Augustus usually does pretty well).7. Downstream Processing and Homology Inference
After running MAKER one now has protein models, but that isn't all together very useful. First, the MAKER default names are long, ugly, and likely difficult for programs to parse. Moreover, even if they were named "gene1", etc. that doesn't tell you anything about what the genes actually are. Therefore, it is necessary to do some downstream processing of the MAKER output and to use homology searches against existing databases to annotate more functional information about genes.
A. First, let's rename the IDs that MAKER sets by default for genes and transcripts. MAKER comes with some scripts to do just this and to swap them out in the GFF and FASTA output (instructions for generated are above). The commands below first create custom IDs and store them as a table, and then use that table to rename the original GFF and FASTA files (they are overwritten, but it is possible to regenerate the raw ones again).
I used BRAKER3 using my softmask genome as input. I hope it helps.