Question

Exonerate Protein2Genome Options And Output

1

Entering edit mode

13.6 years ago

Darked89 4.6k

I did run exonerate 2.2.0 run in a client server mode as follows::

nohup exonerate my_proteome.pep.frag001 localhost:12901 --model p2g --geneseed 250 --showtargetgff yes \ 
        --ryo ">%qi length=%ql alnlen=%qal\n>%ti length=%tlalnlen=%tal\n" \
        --showvulgar no --showalignment no 2> nohup.exonerate.my_proteome.pep.frag001  exonerate_my_proteome.pep.frag001.out &

and in a more sensitive mode without "--geneseed 250" option, then converted the output to gff3 using process_exonerate_gff3.pl script.

In both cases result files are highly redundant (multiple matches of similar proteins to one genome fragment). Some are most likely artifacts (i.e. a protein match jumping over 100kb full of other genes). Also since neither draft genome file nor protein library (i.e. A.thaliana) have been masked/cleaned from repetitive sequences I am getting at times thousands of hits (= one protein -> multiple genome segments). The last problem can be partially fixed (I got incomplete DNA repeat library and A.thaliana proteins can be cleaned up based on descriptions and hmmer search with pfam_07727 domain) but even after that there is a number of proteins (i.e with pentatricopeptide repeat) mapping almost everywhere.

Also it seems that increasing the running time sevenfold (running without --geneseed 250 option) generates more spurious repetitive matches

Hence my questions:

what are the recommended ways of running exonerate in p2g mode?
how hard to mask genome? (RepeatMasker mode)
other PFAM domains used to get rid of repeat proteins?
is there a great advantage of "--refine region" switch?
last but not least: do you use any gff/exonerate output "cleaners" to get rid of suspicious or simply redundant matches?

genome exonerate • 14k views

ADD COMMENT • link updated 7.5 years ago by Michael 54k • written 13.6 years ago by Darked89 4.6k

Ram · Answer 1 · 2010-10-18

I have found a pipeline called gpipe here.

Looking at the Makefiles I got:

--model p2g 
--forcegtag TRUE  
--bestn 200 
--maxintron 50000 
--proteinwordthreshold 3 
--proteinhspdropoff 5 
--proteinwordlen 5 
--forwardcoordinates FALSE 
--score 50

maxintron is really good to have for tandemly duplicated genes. Some combination's of the above may not work well when using exonerate-server. I am checking it right now.

Ram · Answer 2 · 2010-10-21

1

Entering edit mode

13.5 years ago

Darked89 4.6k

Here is another set of options:

exonerate --model protein2genome $file localhost:12887 
--percent 70 
--score 100 
--showvulgar yes 
--softmaskquery no 
--softmasktarget yes 
--minintron 20 
--maxintron 20000 
--ryo “>%qi length=%ql alnlen=%qal\n>%ti length=%tl alnlen=%tal\n” 
--showalignment no 
--showtargetgff yes 
--geneseed 250

from this blog.

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.5 years ago by Darked89 4.6k

score 1 · Answer 3 · 2010-12-02

1

Entering edit mode

13.4 years ago

Giovanni M Dall'Olio 28k

Exonerate is good at aligning cDNAs and proteins to genomic sequences. This is why you find matches jumping over long distances: the program assumes that there is a 100,000bp intron, which is not uncommon in nature. You can use the maxintron option as in darked's answer.

Moreover, if you think that the number of results is redundant, use the -n 1 option, which limits the results to one per query sequence. You can also restrict results by score and percent.

What are you trying to do, exactly? I never used exonerate to align protein sequences so I can't be much of help.

ADD COMMENT • link 13.4 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

I am using it for novel plant genome annotation. Plant protein data sets are either fishy (lot of repeats, bad predictions etc.) or curated but limited. So one can not expect to take a proteome of A and map it 1:1 to B. Moreover there seem to be whole functional I presume protein families with protein repeats (i.e. pentatricopeptide repeat) with hundreds of them in A.thaliana. Add to it tandemly repeated nearly identical genes. While it is not a total mess (results looks mostly sensible) there is a lot of cases where exonerate gene models simply fail.

ADD REPLY • link 13.4 years ago by Darked89 4.6k

0

Entering edit mode

I understand, it must be a problem of the dataset. An alternative may be blat, which is also designed to align proteins and mRNA to genome, but you will have to install locally. The problem is that plants have a lot of duplications :-(

ADD REPLY • link 13.4 years ago by Giovanni M Dall'Olio 28k