Question: What Is The Best Search Engine To Use In Repeatmasker?
5.7 years ago by
Joseph Hughes2.8k
Scotland, UK
Joseph Hughes2.8k wrote:

RepeatMasker works with different search engines: abbblast, rmblast, hmmer, cross_match. Is there anywhere where these different search engines have been benchmarked in terms of repeat detection and false-positive detection, speed etc...?

I'll continue looking but I have found anything yet.

hmmer • 5.5k views
5.7 years ago by Joseph Hughes2.8k
5.7 years ago by
Vancouver, BC
SES8.2k wrote:

It depends on what species you are trying to mask and what the end goal is. RepeatMasker can run for weeks, so it's important to decide up front exactly what you need. Do you need all repeats masked as precisely as possible or do you just want a rough estimate or repetitiveness?

If you are masking a model species like human or Arabidopsis, just download the library of repeats for that species and mask with the fastest engine (probably cross_match but rmblast is probably second, if not first in terms of speed) without doing an exhaustive search. For those species, you can actually download pre-masked genomes. If you are masking a closely related species to one for which there is a library of repeats, you may want to take the same approach but with a more sensitive search.

If you are working with a non-model system, it becomes very difficult to mask using RepBase libraries because TEs evolve rapidly. For example, I have found that I can only mask 50% of the bases of sunflower TEs using RepBase due to the fact that there are no closely related species in RepBase. This highlights the importance of having species-specific repeat libraries for masking. If that is not an option, use a more sensitive, signature based method like nhmmer with models from a range of closely related species. That will give you the most sensitivity, but it will be a bit slower.

5.7 years ago by SES8.2k

Thanks, this is helpful but I am really looking for a comparison between ablest, rmblast, hmmer and cross_match based on sensitivity, specificity and speed. This will enable me to choose the most appropriate search engine to using in RepeatMasker.

Joseph Hughes2.8k

I mentioned these things in my answer in general terms: hmmer is the most sensitive but will be slower than cross_match and the blast-based engines. The most important thing is to choose based on the task, such as distant comparisons vs. masking a model species with a species-specific library of repeats. There is not one engine that is better than the rest for all tasks, that is why there are options. If you have a model species, I wouldn't worry about sensitivity because they will all produce similar results.

5.7 years ago by SES8.2k
5.5 years ago by
United States
timhowes0 wrote:

This is what it says on the RepeatMasker Web Server page:

Cross_match is slower but often more sensitive than the other engines. ABBlast ( formally known as WUBlast ) is very fast with a slight cost of sensitivity. RMBlast is a RepeatMasker compatible version of the NCBI Blast tool suite. HMMER uses the new nhmmer program to search sequences against the new Dfam database ( human only ).

5.5 years ago by timhowes0
