Help with understanding BBduk's behavior
1
0
Entering edit mode
11 months ago
Dave Carlson ★ 1.7k

Hi All,

I'm trying to use BBduk to find and filter exact 18-mer matches in a contaminant fasta file from a set of input reads. BBduk reports that matches exist, however when I examine the reads that are supposed to include one or more kmers, I cannot find the matching kmer (or its reverse complement) in many of the reads.

So I'm trying to better understand what BBduk is doing in hopes that I can figure out why more results are not what I expect.

Here is a toy example that illustrates my confusion.

Input fasta (just a single string of 18 nucleotides for illustration purposes):

>kmer1
CTGGGAGACACCGCATGG

Input fastq file:

@NB551611:70:HGNJ5AFX5:1:11211:6538:6918 2:N:0:ACAGTG
AAAAGGATCTAGCATAGGGAAGATGTTCGAAGCAACCGCTAGGGGAGCTAGACGAATGGCAATACTGGGAGATACCGCATGGGACTTCGGATCGATAGGGGAGCTAGACGAATGGCAATACTGGGAGATACCGCATGGGACTTCGGAGATA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEAEEEEEEEE/<//////A//<AEEEA/<A/6<//A/////</AEA<E/E/<</A6AE/A66/A/E<EE/

My BBduk command:

bbduk.sh in=test.fq ref=one_kmer.fa hdist=0 k=18 stats=test.stats out=test.clean.fq kmask=lc

In this case, I have set hdist=0 because I think this will require exact kmer matches, and I have set kmask=lc to mask print the matching sequence in the output fastq file in lowercase so that it's easier to see what BBduk is considering a match.

Here is the output of the stats file confirming that BBduk found a match:

#File   test.fq
#Total  1
#Matched    1   100.00000%
#Name   Reads   ReadsPct
kmer1   1   100.00000%

And here is the masked fastq output:

@NB551611:70:HGNJ5AFX5:1:11211:6538:6918 2:N:0:ACAGTG
AAAAGGATCTAGCATAGGGAAGATGTTCGAAGCAACCGCTAGGGGAGCTAGACGAATGGCAATActgggagataccgcatggGACTTCGGATCGATAGGGGAGCTAGACGAATGGCAATActgggagataccgcatggGACTTCGGAGATA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEAEEEEEEEE/<//////A//<AEEEA/<A/6<//A/////</AEA<E/E/<</A6AE/A66/A/E<EE/

There are two reported matches in this read, but it's apparent to see that neither is actually an exact match to my input kmer. My input "reference" kmer has a "C" at the 9th position, while the reported matching kmers both have a "T" at that position.

So I don't understand why there is a reported match here. Does hdist=0 actually allow mismatches?

Does anybody have any insights?

Thanks! Dave

BBduk kmer • 614 views
ADD COMMENT
1
Entering edit mode
11 months ago
GenoMax 141k

My input "reference" kmer has a "C" at the 9th position, while the reported matching kmers both have a "T" at that position.

maskmiddle option is true by default as a result you see that base being "matched".

maskmiddle=t        (mm) Treat the middle base of a kmer as a wildcard

If you do the following then you will see that the T shows up.

I generally like to use a value for k that is less than half the size of the fragment you are trying to match. This ensures good initial matches for later extension. Here I am turning maskmiddle=f so you can see the mismatched T.

$ bbduk.sh -Xmx2g in=test.fq out=stdout.fq literal=CTGGGAGACACCGCATGG k=8 hdist=0 kmask=lc maskmiddle=f

@NB551611:70:HGNJ5AFX5:1:11211:6538:6918 2:N:0:ACAGTG
AAAAGGATCTAGCATAGGGAAGATGTTCGAAGCAACCGCTAGGGGAGCTAGACGAATGGCAATActgggagaTaccgcatggGACTTCGGATCGATAGGGGAGCTAGACGAATGGCAATActgggagaTaccgcatggGACTTCGGAGATA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEAEEEEEEEE/<//////A//<AEEEA/<A/6<//A/////</AEA<E/E/<</A6AE/A66/A/E<EE/

If I set hdist=1 then you no longer see that T.

$ bbduk.sh -Xmx2g in=test.fq out=stdout.fq literal=CTGGGAGACACCGCATGG k=8 hdist=1 kmask=lc maskmiddle=f

@NB551611:70:HGNJ5AFX5:1:11211:6538:6918 2:N:0:ACAGTG
AAAAGGATCTAGCATAGGGAAGATGTTCGAAGCAAccgctaggGGAGCTAGACGAATGGCAATActgggagataccgcatggGACTTCGGATCGATAGGGGAGCTAGACGAATGGCAATActgggagataccgcatggGACTTCGGAGATA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEAEEEEEEEE/<//////A//<AEEEA/<A/6<//A/////</AEA<E/E/<</A6AE/A66/A/E<EE/

If you use k=9 you will see that the match changes as below

$ bbduk.sh -Xmx2g in=test.fq out=stdout.fq literal=CTGGGAGACACCGCATGG k=9 hdist=0 kmask=lc maskmiddle=f

@NB551611:70:HGNJ5AFX5:1:11211:6538:6918 2:N:0:ACAGTG
AAAAGGATCTAGCATAGGGAAGATGTTCGAAGCAACCGCTAGGGGAGCTAGACGAATGGCAATACTGGGAGATaccgcatggGACTTCGGATCGATAGGGGAGCTAGACGAATGGCAATACTGGGAGATaccgcatggGACTTCGGAGATA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEAEEEEEEEE/<//////A//<AEEEA/<A/6<//A/////</AEA<E/E/<</A6AE/A66/A/E<EE/

Using a k of 10 or more does not find the match at all.

ADD COMMENT
0
Entering edit mode

Thanks a lot, GenoMax!

I had completely overlooked the maskmiddle default setting, and that was what was throwing me off. I appreciate the help!

ADD REPLY

Login before adding your answer.

Traffic: 1796 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6