Question

BBduk log and stats appear to be inconsistent

2

Entering edit mode

12 months ago

Dave Carlson ★ 1.7k

Hi All,

I'm using BBduk to filter out reads where there is a kmer match to a specific set of contaminant sequences. Here is an example command:

bbduk.sh \
in1=${FQ1} \
in2=${FQ2} \
ref=${contaminant_fasta} \
k=21 \
hdist=1 \
stats=${OUTPUT}/${SAMPLE}_stats.txt \
out1=${OUTPUT}/${SAMPLE}_clean_R1.fastq \
out2=${OUTPUT}/${SAMPLE}_clean_R2.fastq \
outm=${OUTPUT}/${SAMPLE}_contaminant_match.fastq

However, I noticed that there appears to be a small inconsistency between the number of reads that match the contaminant fasta file in the job log vs the number that match the contaminant fasta file in the stats file.

Job log:

Input:                      15786272 reads      2234206006 bases.
Contaminants:               82980 reads (0.53%)     11953545 bases (0.54%)
Total Removed:              82980 reads (0.53%)     11953545 bases (0.54%)
Result:                     15703292 reads (99.47%)     2222252461 bases (99.46%)

Stats file:

#Total  15786272
#Matched    70789   0.44842%
#Name   Reads   ReadsPct

Obviously, the difference is not large, but I'm wondering why the read counts and percentages don't match exactly. Has anyone ever experienced this before?

bbduk bbmap • 748 views

ADD COMMENT • link updated 12 months ago by GenoMax 141k • written 12 months ago by Dave Carlson ★ 1.7k

1

Entering edit mode

Do you have more than one sequence in contaminant fasta? Can you write out refstats file? How many reads end up in the outm file (is it 82980 or 70789)?

ADD REPLY • link 12 months ago by GenoMax 141k

1

Entering edit mode

There is only a single sequence in the contaminant fasta. Here is the refstats output:

#Reads  15786272
#Mapped 70789
#References 1
#Name   Length  Scaffolds   Bases   Coverage    Reads   RPKM
DENV2-E-W_Min_CDX-420_clone_1H10_preMVS_vial168_IRR 733 1   10119041    13804.9673  70789   1364256.4515

The number of reads in the outm file is 82980--i.e., it matches the log, not the stats file.

The default behavior of bbduk is to remove both the forward and reverse reads if either one of them has a kmer match.

So one possibility could be that the stats file is reporting the total number of reads that had a kmer match, while the log file is reporting the total number of reads that were removed (including reads that did not have a kmer match but were removed because its mate did).

ADD REPLY • link 12 months ago by Dave Carlson ★ 1.7k

0

Entering edit mode

So one possibility could be that the stats file is reporting the total number of reads that had a kmer match, while the log file is reporting the total number of reads that were removed (including reads that did not have a kmer match but were removed because its mate did).

That is logical. Stats file gives reads that had a match and the output log gives the total number of reads removed.

ADD REPLY • link 12 months ago by GenoMax 141k