I did run exonerate 2.2.0 run in a client server mode as follows::
nohup exonerate my_proteome.pep.frag001 localhost:12901 --model p2g --geneseed 250 --showtargetgff yes \ --ryo ">%qi length=%ql alnlen=%qal\n>%ti length=%tlalnlen=%tal\n" \ --showvulgar no --showalignment no 2> nohup.exonerate.my_proteome.pep.frag001 exonerate_my_proteome.pep.frag001.out &
and in a more sensitive mode without "--geneseed 250" option, then converted the output to gff3 using process_exonerate_gff3.pl script.
In both cases result files are highly redundant (multiple matches of similar proteins to one genome fragment). Some are most likely artifacts (i.e. a protein match jumping over 100kb full of other genes). Also since neither draft genome file nor protein library (i.e. A.thaliana) have been masked/cleaned from repetitive sequences I am getting at times thousands of hits (= one protein -> multiple genome segments). The last problem can be partially fixed (I got incomplete DNA repeat library and A.thaliana proteins can be cleaned up based on descriptions and hmmer search with pfam_07727 domain) but even after that there is a number of proteins (i.e with pentatricopeptide repeat) mapping almost everywhere.
Also it seems that increasing the running time sevenfold (running without --geneseed 250 option) generates more spurious repetitive matches
Hence my questions:
- what are the recommended ways of running exonerate in p2g mode?
- how hard to mask genome? (RepeatMasker mode)
- other PFAM domains used to get rid of repeat proteins?
- is there a great advantage of "--refine region" switch?
- last but not least: do you use any gff/exonerate output "cleaners" to get rid of suspicious or simply redundant matches?