Question: Masking redundancy/paralogy/repetitiveness in a genome assembly using a k-mer approach
gravatar for apelin20
5.6 years ago by
apelin20470 wrote:


As you may all know, most analysis on variation requires exclusion of paralogous regions which tend to inflate diversity.

Common approaches are to filter SNPs based on sequencing depth, in order to avoid repetitive regions.

Is there a way to mask with NNNNN all regions in an assembly which share k-mers with other regions and only keep regions present only once? I know BBDuk can mask regions from one fast file present in another fasta file, but what I am looking for is to work with one fasta file and mask repetitive k-mers.



paralogy bbmap ngs assembly k-mer • 2.4k views
ADD COMMENTlink modified 5.6 years ago by Brian Bushnell17k • written 5.6 years ago by apelin20470
gravatar for Brian Bushnell
5.6 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

Hi Adrian,

I actually have a couple of programs that can be used for that general purpose, depending on the specifics.

First off, there's BBMask [], which can mask low-entropy areas in a genome - for example, ATATATATATAT....etc.  You can adjust the entropy window size and entropy level; the default settings mask approximately 1% of the human genome, basically covering all of the areas that are low-complexity enough that human shares them exactly with plants and fungi.  BBMask can also accept a sam file mapped to a genome and mask everywhere the sam reads hit.  You could, for example, shred a genome into 100bp pieces, map them to itself, and make a sam file of only the multi-mapping reads, then mask everywhere they hit.

But if you want a kmer-based approach, you can use to generate a fasta file containing all kmers that exist at least 2 times in the genome, then mask those with BBDuk, like this: in=ref.fa out=kmers.fa mincount=2 k=31 in=ref.fa ref=kmers.fa out=masked.fa ktrim=N k=31 mm=f

Please note that for the purposes of calling variations, I highly recommend mapping to the unmasked genome.  You can then ignore variations occurring in regions that would have been masked...  but if you map to the masked genome, you can end up with a read that came from the masked portion (say, a gene with 2 identical copies) mapping to a homologous-but-not-identical region, causing false-positives.  Typically, if I am interested in calling high-quality variants, I throw away ambiguously-mapped (multi-mapped) reads rather than masking the genome.

ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by Brian Bushnell17k

I love your last approach:) Trying right now

ADD REPLYlink written 5.6 years ago by apelin20470

Odd, got an error:

@BioPower3-IBM ~/programs/bbmap $ sh in=/home/adrian/Encephalitozoon/Assembly/ECIII_Lemming_assembly_3months.fasta mincount=2 k=23 out=/home/adrian/Encephalitozoon/Assembly/ECIII_Lemming_assembly_masked.fasta
java -ea -Xmx128561m -Xms128561m -cp /home/adrian/programs/bbmap/current/ jgi.KmerCountExact in=/home/adrian/Encephalitozoon/Assembly/ECIII_Lemming_assembly_3months.fasta mincount=2 k=23 out=/home/adrian/Encephalitozoon/Assembly/ECIII_Lemming_assembly_masked.fasta
Executing jgi.KmerCountExact [in=/home/adrian/Encephalitozoon/Assembly/ECIII_Lemming_assembly_3months.fasta, mincount=2, k=23, out=/home/adrian/Encephalitozoon/Assembly/ECIII_Lemming_assembly_masked.fasta]

Initial size set to 63972057
Memory: max=129189m, total=129189m, free=128515m, used=674m

Initialization Time:          112.656 seconds.
Before Load:
Memory: max=129190m, total=129190m, free=19779m, used=109411m

Exception in thread "Thread-3" java.lang.AssertionError:
An input file appears to be misformatted.  The character with ASCII code 63 appeared where a base was expected.

[63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63.....

Maybe it's because scaffold %non-ACGTN=0.03

    at stream.Read.<init>(
    at stream.Read.<init>(
    at stream.FastaReadInputStream.generateRead(
    at stream.FastaReadInputStream.fillList(
    at stream.FastaReadInputStream.hasMore(
    at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(
    at stream.ConcurrentGenericReadInputStream$


ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by apelin20470

ok. I think it was a couple of "?" chars in my genome... No idea how they got there. Removed them and no error.

ADD REPLYlink written 5.6 years ago by apelin20470

I see what you mean by mapping to an unmasked genome. Indeed, mapping to a masked genome generates a lot of false positives.

Is there anyway to generate a .bed file with information as to what regions were masked? Then it becomes easy to filter variants called in these regions.

ADD REPLYlink written 5.6 years ago by Adrian Pelin2.4k

None of my programs generate bed files, but I will plan to add that capability.  Perhaps there is an existing tool somewhere that can create a bed file indicating the locations of all the Ns in a genome?

ADD REPLYlink written 5.6 years ago by Brian Bushnell17k

I was thinking about that too. If there are 20-50 Ns in a row, there might be no point is masking it, might just be a coincidence, but if we get regions of significant length masked (an I am not sure what those would be, maybe over 100bp?) then we can chose to exclude those. I have long looked for .bed generators of NNNN regions, they would be useful for several tools (Scaffolders do not produce .bed files and some tools request .bed locations of gaps in the assembly). I will look into this some more.

ADD REPLYlink written 5.6 years ago by Adrian Pelin2.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 681 users visited in the last hour