BBDuk quality filtering not producing expected results
1
0
Entering edit mode
2 days ago
rebeliscu • 0

I'm trying to trim/filter low quality reads from paired-end exome-seq data, using BBDuk.

I used the command:

for ea in $files; do R1="$ea"
R2=$(echo$R1 | sed "s/R1/R2/")
/home/shared/programs/bbmap/bbduk.sh -Xmx1g in1=$R1 in2=$R2 \
out1="$(echo$ea | sed s/.fastq.gz/_trimmed_filtered.fastq.gz/)" \
out2="$(echo$(echo \$ea | sed s/R1/R2/) | sed s/.fastq.gz/_trimmed_filtered.fastq.gz/)" \
t=10 ktrim=r k=23 kmin=11 hdist=1 maq=10 minlen=60 tpe tbo
done;


After running fastqc on the output of this, I'm seeing that R2 files have some reads with low quality scores (see per sequence quality score), and the overrepresented sequence "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN", which should have been filtered out by quality filtering, no?

Any help here would be much appreciated.

filtering trimming QC BBDuk quality • 230 views
2
Entering edit mode

Take a look at the quality associated with those reads? Perhaps it has other bases that are satisfying maq=10 criteria.

Are those pure N's? You may want to use this to filter reads with N's out.

maxns=-1            If non-negative, reads with more Ns than this
(after trimming) will be discarded.

0
Entering edit mode

Will try that, thanks.

Even so, though, shouldn't the plot for 'per sequence quality score' be zero for any quality score <10 based on the use of 'mapq=10'?

Edit: Looking at these sequences in the fastq:

@HISEQ:525:HMFYNBCXX:1:1101:1380:2167 2:N:0:CAGATC
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@HISEQ:525:HMFYNBCXX:1:1101:1276:2219 2:N:0:CAGATC
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@HISEQ:525:HMFYNBCXX:1:1101:1238:2328 2:N:0:CAGATC
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN


...Which would suggest these sequences have average quality scores = 0.

0
Entering edit mode
2 days ago
h.mon 33k

You are only discarding reads by average quality (maq=10), but you didn't set a quality cutoff trimming quality, such as trimq=10. Reads passing the average quality cutoff are kept whole, you need the trim parameter as well to trim low quality windows.

edit: the correct answer is the default value for performing quality trimming is set to false (qtrim=f), you should also set it to qtrim=rl, otherwise no quality trimming is performed, despite you setting maq=10.

0
Entering edit mode

Right, but again, those "N" reads should not have passed the average quality cut off, which is what is confusing to me.

0
Entering edit mode

Maybe their R1 pair are of good quality, and both pairs are then kept, to keep the R1 and R2 files synchronized? However, if I am not mistaken, this is not the default behaviour (as per default, removeifeitherbad=t), so it seems you hit a bug.

BBDuk does a lot of different things, and some operations have precedence over others. I know kmer matching is done before quality filtering. If the removeifeitherbad=t filter is applied after kmer matching but before quality trimming, then maybe both reads are kept, even if one is really bad.

Can you post the pairs of the reads above?

0
Entering edit mode

Here's (different) all "N" reads from read2 fastq, and corresponding reads in read1 fastq:

Read2: 1:@HISEQ:525:HMFYNBCXX:1:1101:1380:2167 2:N:0:CAGATC 2-NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 3-+ 4-####################################################################################################################### Read1: 1:@HISEQ:525:HMFYNBCXX:1:1101:1380:2167 1:N:0:CAGATC 2-GCAGCATAATCTTCAGGTGACAAAGTCTTGAGCCAGAGATCCAATTCCACTTTGTATTCCTTCTGCAGCAACTCAGCCTGGCTCTTGTAATGATCCTTCTG 3-+ 4-DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIHIIIHIIIIIIIHIIHIIIIII#

Read2: 5:@HISEQ:525:HMFYNBCXX:1:1101:1276:2219 2:N:0:CAGATC 6-NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 7-+ 8-####################################################################################################################### Read1: 5:@HISEQ:525:HMFYNBCXX:1:1101:1276:2219 1:N:0:CAGATC 6-GAATAAGAGAGGTAATGAGGGAAACAAGCCTATGGGGAGGGCCCAGATGGGAAGCAGGTGCTTCTGANTCTTGATGTTCAGGGTATGGGCTGGGTCAGCTG 7-+ 8-D@@@BGEHHCHHEHHFFIHIIEGHHIIHIFIIIIHHGH?EHHEHIE@@FEFHFEEHHIFEHC@CGH@#<<<C<F?@HHEHEHHEHEHIHHHIHHFEHEEC#

Read2: 2861:@HISEQ:525:HMFYNBCXX:1:1101:1238:2328 2:N:0:CAGATC 2862-NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 2863-+ 2864-################################################################################################################### Read1: 2861:@HISEQ:525:HMFYNBCXX:1:1101:1238:2328 1:N:0:CAGATC 2862-CAAATCTTCTCTAGCTTTGCTGAAAAAAAAATAGTAACTTTATAAATCCNTGCTAATTCTTTTTCGANAAATTATTTAGTCTTGAGTTCAAATCCAGTATT 2863-+ 2864-@0@@DIHHH11<CGH1<C@?GH11DHIHHC<FHFCEGH@FHHCE1C1<1#1<<DCFCGEFHCEFH<<#<<1<1<D@HCHHIIGI?EHEEEH<0GCFHEHGC

0
Entering edit mode

Why are you reads not starting with @something? That breaks the fastq format. Or those numbers at beginning are simply from some operation you did on the files?

You can see that read 1 in each case is good but read 2 is bad. I suggest that you simply set maxns=10 and that will filter out these read pairs.

Looks like you have some N in middle of the reads. Hope there are not too many of them.

0
Entering edit mode

Those numbers before the @ are the row number from grep.

I tried the maxns approach, and that seemed to do the trick. Thanks!