4.8 years ago by
Walnut Creek, USA
I actually have a couple of programs that can be used for that general purpose, depending on the specifics.
First off, there's BBMask [bbmask.sh], which can mask low-entropy areas in a genome - for example, ATATATATATAT....etc. You can adjust the entropy window size and entropy level; the default settings mask approximately 1% of the human genome, basically covering all of the areas that are low-complexity enough that human shares them exactly with plants and fungi. BBMask can also accept a sam file mapped to a genome and mask everywhere the sam reads hit. You could, for example, shred a genome into 100bp pieces, map them to itself, and make a sam file of only the multi-mapping reads, then mask everywhere they hit.
But if you want a kmer-based approach, you can use kmercountexact.sh to generate a fasta file containing all kmers that exist at least 2 times in the genome, then mask those with BBDuk, like this:
kmercountexact.sh in=ref.fa out=kmers.fa mincount=2 k=31
bbduk.sh in=ref.fa ref=kmers.fa out=masked.fa ktrim=N k=31 mm=f
Please note that for the purposes of calling variations, I highly recommend mapping to the unmasked genome. You can then ignore variations occurring in regions that would have been masked... but if you map to the masked genome, you can end up with a read that came from the masked portion (say, a gene with 2 identical copies) mapping to a homologous-but-not-identical region, causing false-positives. Typically, if I am interested in calling high-quality variants, I throw away ambiguously-mapped (multi-mapped) reads rather than masking the genome.