Question: First time using maker
2
gravatar for Roxane Boyer
2.7 years ago by
Roxane Boyer920
France / Toulouse / GeT-Plage
Roxane Boyer920 wrote:

Hi everyone !

I'm trying to do an assembly of a D. suzukii genome, a close related species of D. melanogaster, but with a slighyt bigger genome ( D. mel : 150 M, D. suzukii : about 220M). After the assembly, I want to use maker. But it's my first time using it.

I spend a lot of time reading the manual, which is really detailed : http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial#Genome_Options_.28Required.29

I read that Maker is running some ab inito prediction (like augustus), and then also use some external evidences (EST evidence from the species or a related species, and the same for protein evidence). After that, it try to make a kind of consensus to annotate the genome (if I'm right).

About the input for EST evidence and protein evidence, I was wondering, the manual say that you could give a Fasta file, or (and ? ) a gff file. But I don't know if both are necessary (I don't think so actually, because the little example for the tutorial only used a fasta file). Do we need to add a gff or just a fasta is good ? (Or maybe just a gff is good even if no fasta ?)

Also, as D. suzukii is not really a emerging model organism, as it's closely related to D. melanogaster, do I need to do the train step of ab initio gene predictors ?

Thanks for your answers !

Cheers,

Roxane

maker annotation genome • 2.8k views
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Roxane Boyer920

Dear Roxane I am facing this same problem that you had with MAKER, I already have read all the post and despite some of my doubts have been fixed I Ihave a couple of questions and I was wondering if you can help me please!

Ivan, Institute of Ecology, UNAM, Mexico

ADD REPLYlink written 6 months ago by imda0

Hello imda !

It's been a while I didnt used maker, but I can perhaps help you with that ! What are your questions ?

Cheers,

Roxane

ADD REPLYlink written 6 months ago by Roxane Boyer920

Sorry for this very delay response, I was triying to fix some bugs in my assembly. My questions are two?

If you do not have EST for your species, what did you do? and the second one is about the running time for maker, I have been seen that is very very versy slow! how can I speed up the annotation? did you split your genome into small chunks?

Thank you very much

ADD REPLYlink written 4 months ago by imda0

Hello Imda !

1) If you don't have any RNAseq data from the species you want to annotate, you can still use proteins evidences from several closely related species. But I won't advice to do so, I think the best way to annotate a genome is using EST from the same species if you want it to be accurate. Maybe it depends on what you need to do tho. Maker will still works without EST and try to make the best predictions using what he have (tools that predict gene structure such as SNAP etc and proteins from a closely related species)

2)And yes, maker can take a very long time, for me it was about 3-4 days an iteration (and you need at least 2 or 3 for the full maker pipeline to train SNAP etc...). I was thinking at some point to launch maker like contig by contig, but that would need to slip the evidences as well... I'm not sure how this would impact the whole annotation process. Perhaps anyone else has tried such a method ? Maybe maker now take an option in order to let the process be multithreaded on a cluster or something ? I really don'y know sadly :/

Cheers,

Roxane

ADD REPLYlink written 4 months ago by Roxane Boyer920

Dear Roxane, I think that I already fixed the problem with maker about split the genome in many fasta files. I used the tool from maker called fasta_tool. This script split the genome into many chunks and then you can run maker in each chunk and all should run very well. You can apply this method if you do not have a MPI.

Cheers

ADD REPLYlink modified 3 months ago • written 3 months ago by imda0

Very nice to know ! So how did it went in the end ? Was the annotation good with the spliting process ?

ADD REPLYlink written 3 months ago by Roxane Boyer920

Yes, all resulted in a good an annotation and it took like 7 days to finish a genome of 1.6 Gb.

ADD REPLYlink written 9 weeks ago by imda0
3
gravatar for SES
2.7 years ago by
SES8.2k
Vancouver, BC
SES8.2k wrote:

Do we need to add a gff or just a fasta is good ?

You do not need both. Normally, people would input a set of assembled ESTs as FASTA for training. This step uses blastn and est2genome (from exonerate) for the alignments, and maker also does a lot of polishing/filtering of the alignments to produce the best transcript models. This whole process can be time consuming if you have a large genome. For this reason, you do not want to keep realigning ESTs and proteins redundantly, but instead provide them in a GFF after the initial training step.

However, you may provide FASTA/GFF of ESTs from a closely related species, in addition to your species, to give maker hints. That is a good reason to use both.

Do I need to do the train step of ab initio gene predictors?

It really depends on what your goal is, but the short answer is that you do not have to train each program. For Drosophila species, you can tell Augustus (in the maker configuration) to use the Drosophila models. I would suggest this approach unless you are annotating a genome and are aiming to create gene models for this species. I say this because training Augustus is very time consuming and difficult. I would recommend training SNAP and running maker at least twice though, because training SNAP is fast and quite easy.

ADD COMMENTlink written 2.7 years ago by SES8.2k

Hi SES, thanks for your answers !

So, maybe I wasn't very clear, but my goal is indeed to annotate the genome as accurately as possible.

As I understand by reading your answers, I will need :

For EST : -a fasta file of RNA evidences (D. suzukii, my main species), obtained by processing the output fastq RNAseq file with est2genome -a gff file corresponding to theses RNA evidences -a fasta file of RNA evidences (D. melanogaster, my close related species) -a gff file corresponding, again, to theses evidences

For Protein : Sorry to bother you, but I'm not really sure of what you were meaning :

This whole process can be time consuming if you have a large genome. For this reason, you do not want to keep realigning ESTs and proteins redundantly, but instead provide them in a GFF after the initial training step.

Does that means that protein evidences + EST evidences is redundant ?

I don't think that I have protein evidence for D. suzukii, does protein evidences of D. melanogaster will be enough ? As protein are "evolving slower" than RNA ?

About gene prediction

I thought that we need to chose for a particular program, so I have to choose between SNAP and Augustus right ? Both are not working in the same time ?

Also, something isn't clear for me about the training process, I read several times this manual section, but I can't say exactly how the training is working. Do I have to run a first times maker, then to train ab initio predictors and then to re-run Maker ?

Thanks !

ADD REPLYlink written 2.7 years ago by Roxane Boyer920
2

Use several ab-initio tools can improve your results. So if you can train snap additionaly to Augustus it would be better (adding genemark_ET would be even better).

Aligning fasta sequence is time consuming. SES meant if you re-launch MAKER do not feed it again with fasta sequences but rather with the gff produced by the previous run.

If you have proteins or transcripts from related species, add them to protein_gff= and altest_gff= accordingly if they are gff format, or in protein= and altest= if they are fasta format.

If you have proteins from the investigated species, add them in the protein_gff= or protein=* accordingly if it's a gff or a fasta file. If you have transcripts from the investigated species, add them in the *est_gff= or *est=* accordingly if it's a gff or a fasta file.

/!\ Always separate different Name/path by a semicolon “,”.

for the repeats: model_org=fly for the abinitio: augustus_species=fly snaphmm= hmm profile if you have one and don't forget to set these parameter as following: est2genome=0 protein2genome=0

Only one run with the "fly" repeat, the hints (proteins+transcript) and the ab-initio (at least Augustus), should be sufficient to produce a good annotation.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Juke-342.1k

Wow ! That's a lot of informations ! Thanks !

I will need some time to fully understand, it's my first annotation project, I'm kind of confused.

What do you mean by :

!\ Always separate different Name/path by a semicolon “,”.

You mean, in my opt_file, like this ? :

arg1=/my/path/to/something , /my/path/to/an/other/something

So, when re running maker in the same folder, I can reuse informations that were produced the run before, and then just change the parameters to do some further analysis ?

So the abinitio training is with est2genome=0, and protein2genome=0 ? And then once it's train, I turn this option at 1 is that right ?

ADD REPLYlink written 2.7 years ago by Roxane Boyer920

About the semicolon, you're right.

When re-running maker, it might be better to be in a new folder (as you prefer - usually between an evidence-based annotation and an abinitio-evidence-driven annotation I do that in two different folders. For other parameter modifications, I keep the gff result and relaunch MAKER in the same folder). Anyway, basically you modify the option to remove the path to the fasta file provided for the first run. You give instead the path to the corresponding gff produced by the previous run. (the gff of the transcripts/ests is called est2genome and the gff from protein is called protein2genome )

est2genome and protein2genome must be set to 0 when using abinitio tools. Otherwise the abinitio tools will not use the evidence to improve their gene models. To train an abinitio tool you need an annotation. So for non-model species without any hmm profile, you have to do first an annotation only based on the different lines of evidence. So, in that case you use est2genome and protein2genome set to 1. As you already have an hmm profile for Augustus, you can perform an annotation directly using Augustus and use that annotation to train snap.

ADD REPLYlink written 2.7 years ago by Juke-342.1k

Okay, for the ab initio training, thanks for explaining me, I think I'll have to read again and again the maker tutorial to understand better, it's not clear for me. est2genome=0 means that you are using the abinitio tools, but set to 1 means that you are training it ? I'm confused

I have an other question about the EST evidence that are in input. The EST evidences could be RNAseq data, that are processed in order to be in a transcriptome file. What I'm thinking about, but I'm not sure, is that Maker take in input a Fasta file, produced by using Tophat and then Cufflinks for example. Cufflinks output is a gtf (or gff don't remember), but there is a tool in Cufflink that is able to transform a gtf Cufflink output into a Fasta file. So I was think about giving theses files to the EST evidences data (fasta + gff). But sometimes, Cufflink is not very optimised for dense gene genome like Drosophila genomes. So maybe we also like to use Trinity for example instead.

So, can I use different EST evidences like this : est= my/cufflink/output.fasta, my/trinity/output.fasta gff=my/cufflink/output.gff, my/trinity/output.gff

Could it help maker to use several kind of EST evidences (obtained in a different way) when annotating or not ?

ADD REPLYlink written 2.7 years ago by Roxane Boyer920
1

est2genome=0 means you don't create gene models based on EST. But in the case you fed MAKER with EST, and you set up it to use an ab initio tool, the abinitio tool will use the ESTs to improve its accuracy (Abinitio-evidence-driven).

Setting it to 1 means you create gene models based on EST. And if you set up it to use an ab initio tool, the abinitio tool WILL NOT use the ESTs information to improve its accuracy (pure Abinitio).

MAKER doesn't train an abinitio tool for you ! It just allows to create an annotation that can be to use train your abinitio tool. But the training process is something apart, and dependant to the tool you want to train.

No need to use several time the same information, the redundancy is removed by MAKER. It's not a good approach to convert a gff/gtf in fasta format in order to align it through MAKER. You will loose some information during the alignment process compared by what you have within your gtf/gff file. So: est=my/trinity/output1.fasta,my/trinity/output2.fasta est_gff= my/cufflink/output1.gff, my/cufflink/output2.gff

I guess you will have to convert the gtf from cufflinks in gff in order to be MAKER compliant.

Could it help maker to use several kind of EST evidences (obtained in a different way) when annotating or not ?

Yes

ADD REPLYlink written 2.7 years ago by Juke-342.1k

Hi Juke ! Thanks for your answers, it helped me a lot.

Just to be sure that I get it, est2genome=0 means that Maker will use separately the information contained in EST evidences and the ab initio predictions, but if set to 1, it means that, during the ab initio predictions, it's going to use EST evidences during the process on ab initio predictions.

So, you advised me later that :

est2genome and protein2genome must be set to 0 when using abinitio tools. Otherwise the abinitio tools will not use the evidence to improve their gene models

Well, it seems it was the opposite of what I understood. Can you just explain it to me one last time please ?

ADD REPLYlink written 2.7 years ago by Roxane Boyer920
2

You're welcome, let's try again... I know it's not easy to get.

Let's imagine a case where we have just transcripts and an abinitio profile for Augustus. Case 1) If you set est2genome=0 and you don't activate the abinitio. You will have not an annotation.

Case 2) If you set est2genome=1 and you don't activate the abinitio you will have an annotation with gene models predicted only based on the est.

Case 3) If you set est2genome=1 and you activate the abinitio you will have an annotation with gene models predicted based on the est and base on pure abinitio. If in a same locus you have a gene model from EST and from the abinitio, MAKER will chose only one of them to report in the final annotation.

Case 4) If you set est2genome=0 and you activate the abinitio you will have an annotation with gene models predicted based only on the abinitio. But in that case the abinitio is Evidence-driven, it means the alignment of the EST is used and transmitted to the abinitio tool to improve its accuracy.

Case 5) If you don't provide the EST and you activate only the abinitio, you will have a pure abinitio annotation. (Roughly the same as using the abinitio tool standalone).

In case 5 (It largely depends on the quality of the profile) you can Roughly expect a sensitivity of 60%. So you can expect to have a sligly better annotation in case 3. But in case 4 you should largely improve the sensitivity and expect something above 70%.

ADD REPLYlink modified 3 months ago • written 2.7 years ago by Juke-342.1k

Thanks a lot ! I will refer to this post now to organise my maker tries. So, we have to train abinitio software (like SNAP, because there is a parameter that ask for an HMM snap profile, now I know how to produce it myself), but Maker is also launching an abinitio software itself ?

Now I'm confused because we can provide the result of the training of the abinitio tool ( snap ), but that train should be produced after launching several time maker ( I read during a looong time the Maker manual, and it seems that it's a bootstrap like step). This train will produce a .hmm file that will be used for maker.

But maker use that hmm model just like this ? Or it launch itself an abinitio tool ?

ADD REPLYlink written 2.7 years ago by Roxane Boyer920
1

MAKER will launch the abinitio tool only if the path to it is known (check the maker_exe.ctl) and you provide an hmm profile in the maker_opts.ctl. btw, profile<=>hmm model is the same thing

Augustus profile already exists through the name "fly", if you plan to use snap too you have to train it by yourself true and fill the "snap=" parameter by the path to this profile.

MAKER doesn't really use the hmm model itself. It will launch for you the abinitio tool you have inquired with the hmm model you specified.

Usually snap is train in several steps. First annotation you don't have any SNAP model. So you launch Maker without SNAP. Then with this annotation you train SNAP and get you snap_hmm1. Secondly, you re-launch MAKER but that time you use SNAP with the snap_hmm1 profile. Then you use the annotation you get to train again SNAP and create a new hmm profile you can call snap_hmm2. Third (and usually last) you re-re-launch MAKER but that time you use SNAP with the snap_hmm2 profile. Then you use the annotation you get to train SNAP again and create a new hmm profile you can call snap_hmm3. => The training is finished. You have your nice hmm profile for SNAP called snap_hmm3.

ADD REPLYlink written 2.7 years ago by Juke-342.1k

That's perfect ! That was all I needed, a huge thanks ! An last question : so if it has the path of both augustus and snap, it will use both of them during ab initio predictions and then uses all kind of evidences ?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Roxane Boyer920
1

In case you activated SNAP and Augustus and you have fed MAKER with lines of evidence (Transcripts and proteins), it will predict gene models using Augustus-Evidence-driven and SNAP-Evidence-driven. In loci where both are present, it will chose the best one according to the lines of evidence (EST / protein when they are present).

ADD REPLYlink written 2.7 years ago by Juke-342.1k

Thanks a lot ! I think that all of our discuss could have been in answers, because now I consider my question as answered, and a lot of details could be useful to many other beginner !

ADD REPLYlink written 2.7 years ago by Roxane Boyer920
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 821 users visited in the last hour