Question: Exonerate Protein2Genome Options And Output
gravatar for Darked89
8.6 years ago by
Barcelona, Spain
Darked894.2k wrote:

I did run exonerate 2.2.0 run in a client server mode as follows::

nohup exonerate my_proteome.pep.frag001 localhost:12901 --model p2g --geneseed 250 --showtargetgff yes \ 
        --ryo ">%qi length=%ql alnlen=%qal\n>%ti length=%tlalnlen=%tal\n" \
        --showvulgar no --showalignment no 2> nohup.exonerate.my_proteome.pep.frag001  exonerate_my_proteome.pep.frag001.out &

and in a more sensitive mode without "--geneseed 250" option, then converted the output to gff3 using script.

In both cases result files are highly redundant (multiple matches of similar proteins to one genome fragment). Some are most likely artifacts (i.e. a protein match jumping over 100kb full of other genes). Also since neither draft genome file nor protein library (i.e. A.thaliana) have been masked/cleaned from repetitive sequences I am getting at times thousands of hits (= one protein -> multiple genome segments). The last problem can be partially fixed (I got incomplete DNA repeat library and A.thaliana proteins can be cleaned up based on descriptions and hmmer search with pfam_07727 domain) but even after that there is a number of proteins (i.e with pentatricopeptide repeat) mapping almost everywhere.

Also it seems that increasing the running time sevenfold (running without --geneseed 250 option) generates more spurious repetitive matches

Hence my questions:

  • what are the recommended ways of running exonerate in p2g mode?
  • how hard to mask genome? (RepeatMasker mode)
  • other PFAM domains used to get rid of repeat proteins?
  • is there a great advantage of "--refine region" switch?
  • last but not least: do you use any gff/exonerate output "cleaners" to get rid of suspicious or simply redundant matches?
exonerate genome • 8.7k views
ADD COMMENTlink modified 2.5 years ago by Michael Dondrup46k • written 8.6 years ago by Darked894.2k
gravatar for Darked89
8.5 years ago by
Barcelona, Spain
Darked894.2k wrote:

I have found a pipeline called gpipe here:

looking at the Makefiles I got:

--model p2g 
--forcegtag TRUE  
--bestn 200 
--maxintron 50000 
--proteinwordthreshold 3 
--proteinhspdropoff 5 
--proteinwordlen 5 
--forwardcoordinates FALSE 
--score 50

maxintron is really good to have for tandemly duplicated genes. Some combination's of the above may not work well when using exonerate-server. I am checking it right now.

ADD COMMENTlink written 8.5 years ago by Darked894.2k
gravatar for Darked89
8.5 years ago by
Barcelona, Spain
Darked894.2k wrote:

Here is another set of options:

exonerate --model protein2genome $file localhost:12887 
--percent 70 
--score 100 
--showvulgar yes 
--softmaskquery no 
--softmasktarget yes 
--minintron 20 
--maxintron 20000 
--ryo ā€œ>%qi length=%ql alnlen=%qal\n>%ti length=%tl alnlen=%tal\nā€ 
--showalignment no 
--showtargetgff yes 
--geneseed 250

from this blog:

ADD COMMENTlink written 8.5 years ago by Darked894.2k
gravatar for Giovanni M Dall'Olio
8.4 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

Exonerate is good at aligning cDNAs and proteins to genomic sequences. This is why you find matches jumping over long distances: the program assumes that there is a 100,000bp intron, which is not uncommon in nature. You can use the maxintron option as in darked's answer.

Moreover, if you think that the number of results is redundant, use the -n 1 option, which limits the results to one per query sequence. You can also restrict results by score and percent.

What are you trying to do, exactly? I never used exonerate to align protein sequences so I can't be much of help.

ADD COMMENTlink written 8.4 years ago by Giovanni M Dall'Olio26k

I am using it for novel plant genome annotation. Plant protein data sets are either fishy (lot of repeats, bad predictions etc.) or curated but limited. So one can not expect to take a proteome of A and map it 1:1 to B. Moreover there seem to be whole functional I presume protein families with protein repeats (i.e. pentatricopeptide repeat) with hundreds of them in A.thaliana. Add to it tandemly repeated nearly identical genes. While it is not a total mess (results looks mostly sensible) there is a lot of cases where exonerate gene models simply fail.

ADD REPLYlink written 8.4 years ago by Darked894.2k

I understand, it must be a problem of the dataset. An alternative may be blat, which is also designed to align proteins and mRNA to genome, but you will have to install locally. The problem is that plants have a lot of duplications :-(

ADD REPLYlink written 8.4 years ago by Giovanni M Dall'Olio26k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2000 users visited in the last hour