SNAP training. How do you actually know that it is enough to train and you can run your final Maker run? I have tried to run it several time and there is a difference in the number of genes every time. It is actually a kind of sinusoidal graph - number of genes are going up and down... So when do you stop? Or how do you know that SNAP is trained? Do you wait until the plateau? How many times did you do the training and why?
My genome has unusually high repeat content. This is why I decided to create its own repeat library with repeatModeler. The question is where on the option file do I add this repeat library?
You can specify a custom repeat library (in FASTA format) with rmlib in the Repeat Masking section of the make_opts.ctl file
#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib=repeatlibrary.fa #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/opt/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)`
This is an example Repeat Masking section
You might also consider running ProtExcluder on the output of RepeatModeler
# Run blastx then ProtExcluder to excluce known protein sequences from RepeatModeler library
/usr/bin/blastx -num_threads 75 -db /genetics/elbers/maker/uniprot_sprot.fasta -evalue 1e-6 \
-query repeatlibrary.fa -out repeatlibrary.fa.blast
/opt/ProtExcluder1.1/ProtExcluder.pl -f 50 repeatlibrary.fa.blast repeatlibrary.fa
# output of ProtExcluder is "temp"
# rename temp to whatever you desire
mv temp repeatlibrary.fa2
I am following your advice and excluding the protein sequences. The question is now- which protein db did you use? Only uniprot? Or combined with refseq? Isn't it redundant?
Do you exclude the transposon sequences as it is pointed out in the Maker wiki? Do you do it by alignment to the transposon library? It sounds like a really simple step, but somehow I am stuck all the way...
The library that is provided in the manual is old (2011) and also appears to be corrupt...
and not worry about combining RefSeq or transposon sequences. Someone with more experience might have better advice to give, but I think this is sufficient.
Hi,
Here is the description of the repeat library construction and by pressing on the link of ProtExcluder you are getting the tar.gz with the script. There is also a link for manual. Good luck!
Look for the link is in section: 4. Exclusion of gene fragments
Many thanks for your help, Jean,
I am following your advice and excluding the protein sequences. The question is now- which protein db did you use? Only uniprot? Or combined with refseq? Isn't it redundant? Do you exclude the transposon sequences as it is pointed out in the Maker wiki? Do you do it by alignment to the transposon library? It sounds like a really simple step, but somehow I am stuck all the way...
The library that is provided in the manual is old (2011) and also appears to be corrupt...
I would use the most up-to-date Swiss-Prot database
and not worry about combining RefSeq or transposon sequences. Someone with more experience might have better advice to give, but I think this is sufficient.
Got you. Thanks again!
Hii.. I've been looking for ProtExcluder but, i couldn't find it out. could u please share the link to download the same?