Question: BBDuk and qtrim parameter
0
gravatar for int11ap1
4.1 years ago by
int11ap1420
Barcelona
int11ap1420 wrote:

Hello,

qtrim is defined as

qtrim=f             Trim read ends to remove bases with quality below trimq.
                    Performed AFTER looking for kmers.
                    Values: 
                            rl (trim both ends), 
                            f (neither end), 
                            r (right end only), 
                            l (left end only),
                            w (sliding window).

What does it mean both ends, both ends of a read or both ends meaning forward and reverse read?

I'm asking this because using qtrim=r, launched with the command:

/Shared/002-Software/bin/trimming.py/libs/bbmap/bbduk.sh
-Xmx1g
in=sequentia20160330T145739/Arancia_S1_R1_001.fastq.gz
in2=sequentia20160330T145739/Arancia_S1_R2_001.fastq.gz
out=Trimmed_output_default/Trimmed_files/Arancia_S1_R1_001.trimmed.fq.gz
out2=Trimmed_output_default/Trimmed_files/Arancia_S1_R2_001.trimmed.fq.gz
minlength=35
ziplevel=1
trimq=25
hdist=1
threads=1
overwrite=true
mink=11
ref=/Shared/002-Software/bin/trimming.py/libs/adapters.fa
qtrim=r
k=31
ktrim=r
ftl=0
ftr=0
ftr2=0

gives the following results:

forward reverse

the image above is composed of reads with qualities like these:

6--A-<-<C-,,,6++@76+8,C,6C,6+6+8,,,,C,6CF,,C,FE,,,,,,,,+:6@++8CC,,669,:,+,:6>,,,=,4,7,,,,9<,,:,,4,,,9,4+4+A,A,C4,,,++9,,C=+@4@B+8+636,@,,,,,7,8,,,2,6,+14+++++0**0<****52+**3*;99*2;F*/8))/))01)4+)))*)+;A+2;>5)0)))22))0).)))))).)*)()()))*)*02**)).06))21959***2:*)))))0))/*/**11**))0/**1+3+2<C,88A;CE

or

CCCC9;C,,C8@A,EEC8,CF99,6,,,<66,CFF,;,,,;,CF,CFC,6CC,,6,,,6,`<CFFGE,;6,,,,BE@FGDE5F<,,,,,,,:AFEFG7,5?,,A,E;:,+,,,,44,+C,,,,,+6=@8@D,,9@,9,,,,,99@+6=;,,,6,,6,,7++,3662+=DF?++=++63++9++*;**;:0*3****0+1*03*3*3+*1=6?)/)*+++*++++++=AG;)+)*+1+1+;CD<AAEFF@>D

and using qtrim=rl, it gives the following ones:

forward reverse

bbduk trimming • 3.4k views
ADD COMMENTlink modified 4.1 years ago by Brian Bushnell17k • written 4.1 years ago by int11ap1420

both ends of a read. If you are using PE reads then both ends of both reads with rl option.

ADD REPLYlink written 4.1 years ago by genomax84k

So, when using paired-end reads with option qtrim=r, will it trim the 3' ends of both reads?

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by int11ap1420

If trimming is needed, yes.

ADD REPLYlink written 4.1 years ago by genomax84k

so I don't understand why I get some results... (see updated post)

ADD REPLYlink written 4.1 years ago by int11ap1420

Can you provide full bbduk command options you are using? Are you also scanning for adapters and trimming them?

ADD REPLYlink written 4.1 years ago by genomax84k

The command has been added to the post. And yes, I look for adapters (in fact, there's a high level of adapters at the 3' end of this dataset).

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by int11ap1420

@Brian recommends adding tbe and tpo options with normal paired-end reads. Can you add those and try a re-run? In the qtrim=r the trimq does not appear to be working since you have Q-scores below 25.
Trimming at 25 is very stringent. Generally 5 (or 10) is sufficient for re-sequencing data. Only if you are doing de novo assembly then you would want to be more stringent.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by genomax84k

I repeated the trimming with tbe=t and tpo=t and it gives the very same results.

ADD REPLYlink written 4.1 years ago by int11ap1420

Are there any other things flagged by FastQC? Are you able to attach a full report? Have you looked at an uncompressed version of the FastQC graphs to see if there are specific cycles that have an issue? Does the per tile sequence quality plot show any uneven splotches of color?

ADD REPLYlink written 4.1 years ago by genomax84k

Tagging Brian Bushnell. He may have additional comments.

ADD REPLYlink written 4.1 years ago by genomax84k

I am assuming for each pair, the top graph is before quality trimming and the bottom graph is after quality trimming. Are they for read 1 or read 2? Or both?

Anyway, the results are strange and not what I would expect, so I am investigating and I'll get back to you. It looks like there is a special case when the entire read is extremely low quality except for the 3' end (which is a very unusual patter for Illumina reads) that is not being correctly handled by qtrim=r. However, without the "tbo" flag, you can get odd-looking results when adapter-trimming, particularly when most of the reads are short-insert. The reason is that adapter kmers will be found in the high-quality reads, and those reads will be trimmed. The adapter kmers in low-quality reads will have too many errors and thus not be found. So, if all of your reads have an insert size of around 200 bp, then the high-quality reads will be trimmed to 200 bp and the low-quality reads will not be adapter-trimmed (since they have too many errors to match the adapter sequence). So reads over 200bp will be highly enriched for low quality. Thus, it will look like the quality dropped in that region, but what actually happened is that the high-quality adapter bases were selectively trimmed. For this dataset, I strongly recommend adding the flags "tbo tpe" and using hdist=2 instead of 1. Also, it looks like you should use qtrim=rl until I address the special case where the entire read aside from the 3' end is low-quality.

ADD REPLYlink written 4.1 years ago by Brian Bushnell17k
3
gravatar for Brian Bushnell
4.1 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

OK, I've investigated... the odd trimming is not a bug or anything, just an odd aspect of your data; Illumina reads with max quality at the 3' end are very unusual. With trimq=25, using this set of qualities you posted:

6--A-<-<c-,,,6++@76+8,c,6c,6+6+8,,,,c,6cf,,c,fe,,,,,,,,+:6@++8cc,,669,:,+,:6>,,,=,4,7,,,,9<,,:,,4,,,9,4+4+A,A,C4,,,++9,,C=+@4@B+8+636,@,,,,,7,8,,,2,6,+14+++++00<*52+3;992;F/8))/))01)4+))))+;A+2;>5)0)))22))0).)))))).))()())))02)).06))219592:)))))0))/*/11))0/1+3+2<c,88a;ce< p="">

The longest region with average quality of at least Q25 is the last 4 bases, with qualities A;CE. All the rest should be trimmed. But that's a left-trim operation, as the ideal trimming would trim the first 293 bases (left-trim) and last 0 bases (right-trim). So with qtrim=r, nothing gets trimmed. That read will not show up in the output with qtrim=rl because the trimmed read is only 4bp long, below the minlength of 35.

So, you can either use qtrim=rl for this dataset, or filter out the low-quality reads with for example "maq=20" which will remove any read with an average quality below 20 after trimming. And I also recommend "tbo tbe" and potentially hdist=2 (if your average insert size is indeed shorter than 300bp) for this dataset.

It would be helpful if you posted the screen output of BBDuk; otherwise it's hard to say whether there is more adapter or quality trimming occurring.

ADD COMMENTlink written 4.1 years ago by Brian Bushnell17k
1

Ok, that was it. By applying a maq=20, quality gets fine. Thank you!!

ADD REPLYlink written 4.1 years ago by int11ap1420

@Brian: I wonder if these reads have been reverse complemented? That would explain the odd Q-score distribution. bcl2fasq v.2 can output RC reads.

ADD REPLYlink written 4.1 years ago by genomax84k

I doubt it; the qscore pattern of the first 6 bases is characteristic of an Illumina forward read. I think the weird patterns after trimming are being created by a small minority of reads; when looking at the graph, it's easy to assume the the number of samples at position 1 is the same as at position 270, but for all we know they could represent only a couple hundred reads on the far right side.

ADD REPLYlink written 4.1 years ago by Brian Bushnell17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1468 users visited in the last hour