Question: Better strategies for repeat masking?
gravatar for Philipp Bayer
4.9 years ago by
Philipp Bayer6.9k
Philipp Bayer6.9k wrote:

I'm using MAKER to annotate all of my plant genomes on a single server, and the biggest bottleneck is RepeatMasker with RepBase. I'd say about 70% of the time is spent in RepeatMasker with the rest being AUGUSTUS/SNAP, which makes sense, RepeatMasker is a lot of blasting which takes very long on a single server.

I'm trying to think of ways to speed this up -  so far I think I should take my reference fasta, split it up, and run RepeatMasker on a cluster with the pieces (where I can't run MAKER atm), merge the masked fasta and use that in MAKER with no RepeatMasker. Are there any alternative, faster algorithms which I can use for repeat masking, or any ideas?

Or even skip RepeatMasker in MAKER, and just filter the resulting transcripts with blastn and RepBase? Then at least the search space is much smaller (or maybe the noise by unmasked regions leads to a massive increase of resulting transcripts, I haven't tested this, has anyone?). I'm using already trained versions of AUGUSTUS/SNAP, so hopefully these shouldn't be too swayed by an increase in repeats...

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Philipp Bayer6.9k

RepeatMasker does have options for parallelisation (multithreading, -pa?). On top of that, you can also split your genome and use different nodes + multiple cores. Also, within RepeatMasker there is a "sensitivity" option that may help you speed up the analyses. And also, there are alternative search engines within RepeatMasker. Some may be faster than others (have no idea which, you can ask RepeatMasker people). I hope this helps.

ADD REPLYlink modified 2.4 years ago by _r_am32k • written 4.9 years ago by abascalfederico1.1k

Thank you for that - MAKER splits the genome and runs that over MPI, so I assume that part is taken care of.

I guess I'll have to fiddle with sensitivity and come up with a ab-blast binary, that should be faster!

ADD REPLYlink modified 2.4 years ago by _r_am32k • written 4.9 years ago by Philipp Bayer6.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2466 users visited in the last hour