What Is The Best Search Engine To Use In Repeatmasker?
2
7
Entering edit mode
10.2 years ago
Joseph Hughes ★ 3.0k

RepeatMasker works with different search engines: abbblast, rmblast, hmmer, cross_match. Is there anywhere where these different search engines have been benchmarked in terms of repeat detection and false-positive detection, speed etc...?

I'll continue looking but I have found anything yet.

hmmer • 10k views
ADD COMMENT
9
Entering edit mode
10.2 years ago
SES 8.6k

It depends on what species you are trying to mask and what the end goal is. RepeatMasker can run for weeks, so it's important to decide up front exactly what you need. Do you need all repeats masked as precisely as possible or do you just want a rough estimate or repetitiveness?

If you are masking a model species like human or Arabidopsis, just download the library of repeats for that species and mask with the fastest engine (probably cross_match but rmblast is probably second, if not first in terms of speed) without doing an exhaustive search. For those species, you can actually download pre-masked genomes. If you are masking a closely related species to one for which there is a library of repeats, you may want to take the same approach but with a more sensitive search.

If you are working with a non-model system, it becomes very difficult to mask using RepBase libraries because TEs evolve rapidly. For example, I have found that I can only mask 50% of the bases of sunflower TEs using RepBase due to the fact that there are no closely related species in RepBase. This highlights the importance of having species-specific repeat libraries for masking. If that is not an option, use a more sensitive, signature based method like nhmmer with models from a range of closely related species. That will give you the most sensitivity, but it will be a bit slower.

ADD COMMENT
0
Entering edit mode

Thanks, this is helpful but I am really looking for a comparison between ablest, rmblast, hmmer and cross_match based on sensitivity, specificity and speed. This will enable me to choose the most appropriate search engine to using in RepeatMasker.

ADD REPLY
1
Entering edit mode

I mentioned these things in my answer in general terms: hmmer is the most sensitive but will be slower than cross_match and the blast-based engines. The most important thing is to choose based on the task, such as distant comparisons vs. masking a model species with a species-specific library of repeats. There is not one engine that is better than the rest for all tasks, that is why there are options. If you have a model species, I wouldn't worry about sensitivity because they will all produce similar results.

ADD REPLY
1
Entering edit mode
10.0 years ago
timhowes ▴ 10

This is what it says on the RepeatMasker Web Server page:

Cross_match is slower but often more sensitive than the other engines. ABBlast ( formally known as WUBlast ) is very fast with a slight cost of sensitivity. RMBlast is a RepeatMasker compatible version of the NCBI Blast tool suite. HMMER uses the new nhmmer program to search sequences against the new Dfam database ( human only ).

ADD COMMENT

Login before adding your answer.

Traffic: 2073 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6