Question: Differences between trimming tools: cutadapt, bbduk and sickle
3
gravatar for rioualen
19 months ago by
rioualen310
France
rioualen310 wrote:

Hello,

I'm starting to use cutadapt instead of sickle, and I'm not sure I understand how it works exactly. I tried to use minimum length of 20 bp and minimum quality score of 20, and the results are super different between the three programs.

This is cutadapt output:

cutadapt -q 20 -m 20 -o output_q20_m20.fastq input.fastq
This is cutadapt 1.13 with Python 2.7.9
Command line parameters: -q 20 -m 20 -o output_q20_m20.fastq input.fastq
Trimming 0 adapters with at most 10.0% errors in single-end mode ...
Finished in 12.03 s (3 us/read; 17.89 M reads/minute).

=== Summary ===

Total reads processed:               3,587,045
Reads with adapters:                         0 (0.0%)
Reads that were too short:             571,747 (15.9%)
Reads written (passing filters):     3,015,298 (84.1%)

Total basepairs processed:   125,546,575 bp
Quality-trimmed:              21,189,744 bp (16.9%)
Total written (filtered):     98,543,830 bp (78.5%)

This is sickle output:

sickle se --fastq-file input.fastq --qual-type sanger --qual-threshold 20 --length-threshold 20 --output-file output_sickle_q20_m20.fastq

SE input file: input.fastq

Total FastQ records: 3587045
FastQ records kept: 2449731
FastQ records discarded: 1137314

This is BBDuk output:

./bbduk.sh -Xmx1g in=input.fastq out=output_bbduk.fq qtrim=r trimq=20 ml=20 overwrite=true
java -Djava.library.path=/home/rioualen/Desktop/bbmap/jni/ -ea -Xmx1g -Xms1g -cp /home/rioualen/Desktop/bbmap/current/ jgi.BBDukF -Xmx1g in=input.fastq out=output_bbduk.fq qtrim=r trimq=20 ml=20 overwrite=true
Executing jgi.BBDukF [-Xmx1g, in=input.fastq, out=output_bbduk.fq, qtrim=r, trimq=20, ml=20, overwrite=true]

BBDuk version 37.10
Initial:
Memory: max=1029m, free=993m, used=36m

Input is being processed as unpaired
Started output streams: 0.020 seconds.
Processing time:        2.012 seconds.

Input:                      3587045 reads       125546575 bases.
QTrimmed:                   3156968 reads (88.01%)  68828955 bases (54.82%)
Total Removed:              1592853 reads (44.41%)  68828955 bases (54.82%)
Result:                     1994192 reads (55.59%)  56717620 bases (45.18%)

Time:               2.059 seconds.
Reads Processed:       3587k    1742.23k reads/sec
Bases Processed:        125m    60.98m bases/sec

How can there be such a big difference between them, eg 15.9%, 31.7% and 44.4% of reads filtered?

Is cutadapt saying " Trimming 0 adapters with at most 10.0% errors in single-end mode " a bug? Cause I checked the fastq files and it is properly removing bp with a score <20.

Below are links to fastQC results

no trimming

cutadapt

sickle

bbduk

cutadapt sickle bbduk trimming • 3.4k views
ADD COMMENTlink modified 19 months ago • written 19 months ago by rioualen310
1

I think you're comparing apples with oranges: cutadapt is supposed to remove adapter sequences from reads, while sickle does quality trimming (scythe does adapter clipping).

ADD REPLYlink written 19 months ago by cschu1811.4k

Cutadapt does both 5'/3' trimming and adapters removal. Here I just used the options for trimming and specified no adapter. Plus, my sample has very few adapter sequences so it wouldn't make such a difference anyway...

ADD REPLYlink written 19 months ago by rioualen310

my sample has very few adapter sequences so it wouldn't make such a difference anyway..

Have you looked at each read to confirm :) On a serious note, I recommend that you try a third option. bbduk.sh from BBMap suite. Easy to use. A file with all commercial adapters included in software source. Capable of even removing single basepair contaminants at ends of reads.

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax58k

I tried cutadapt with the adapters removal + same params and it gives a rate of 16.6% of discarded reads. My point is not to analyze this specific sample, but rather understand what's going on here, and if cutadapt is doing what I think it should be doing...

ADD REPLYlink written 19 months ago by rioualen310

Every program has its own logic and trying to compare the results from two may only be possible if you "looked under the hood" to absolutely make sure that parameters that you think are identical actually are. e.g. perhaps the two programs use different length "seeds" to look for initial adapter matches. Small differences like that can change the final result.

I am familiar with bbduk which uses k-mer matching (and you can specify the length to use). I suggest that you try bbduk.sh (no install is needed once you download, only Java) and see how the results compare to two you have already tried.

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax58k

I added BBDuk output cause it's yet again very different, using again minimuum length of 20bp and Q20 for the quality threshold.

ADD REPLYlink written 19 months ago by rioualen310

I am going to tag Brian Bushnell (author of BBMap suite) so he can provide some additional insight: Brian Bushnell

But from bbduk.sh help:

trimq=20             Regions with average quality BELOW this will be trimmed.
minavgquality=0     (maq) Reads with average quality (after trimming) below 
                    this will be discarded.

In first case as soon as average quality in a region drops below that value (you seem to indicate you have Q scores that go down, up and down across the reads, which is odd and seems to indicate that there may have been some issue with this run e.g. air bubble going through the lane) the rest of the read would be trimmed.
With second option the entire read would be considered to have a certain average quality. Can you try the second option?

Also qtrim=w will give you windowed quality trimming so that would be more similar to sickle.

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax58k

The quality doesn't seem to be going up and down in general, but I thought it might be the case for a fraction of reads, cause I can't think of another explanation so far... (see screencaps attached)

As for the minavgquality parameter, when I set it to 20 I have even more reads discarded (the quality of the samples is not terrific!)

ADD REPLYlink modified 19 months ago • written 19 months ago by rioualen310
1

I edited your original post so the links for the screen caps are now visible. Did you set minavgquality=20 instead of trimq=20 since combining those would lead to a different result.

Do you have a screencap of "before" any trimming? Can you add that for reference?

Are you using the adapters.fa file in the resources directory with your bbduk.sh command?

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax58k

Thanks, I added a screencap before any trimming (like I said, the data is not great but it doesn't matter).

I tried minavgquality=20 instead of trimq=20 but that gets me ~70% of reads trimmed!

I'm doing more tests tomorrow.

ADD REPLYlink written 19 months ago by rioualen310
1

"minavgquality=20" does not trim, it simply discards reads with an average quality below 20. I recommend trimming instead. And as Devon points out, 20 is too high for most purposes.

I don't remember what trimming method is used by sickly or cutadapt, but BBDuk uses the phred algorithm, which leaves the longest area with average quality above the limit you specify such that it cannot be extended without adding an area of average quality below the limit you specify.

ADD REPLYlink written 19 months ago by Brian Bushnell16k

phred algorithm [...] leaves the longest area with average quality above the limit you specify such that it cannot be extended without adding an area of average quality below the limit you specify

dear Brian, thanks for your outstanding program. Please allow me to note, that the description of the trimq parameter is a bit obscure on your BBDuk guide page. If you copied the sentence above (or post 17 in your introductory seqanswers thread) to your documentation, you could maybe save a few forum questions or google efforts

ADD REPLYlink written 5 months ago by Carambakaracho570

Blargh, I stand corrected.

ADD REPLYlink written 19 months ago by cschu1811.4k
1

As an aside, trimming at Q20 is over the top for most applications (ChIPseq, RNAseq, BSseq, variant calling). Trimming around Q5 tends to be more reasonable.

ADD REPLYlink written 19 months ago by Devon Ryan86k

OK based on Cutadapt documentation I think I get the idea: http://cutadapt.readthedocs.io/en/stable/guide.html#quality-trimming-algorithm

Basically the "problem" is that it's trimming by starting at the end of the read. In my case, it seems many reads have a quality that is dropping, then going up, then dropping again. Cutadapt seems to cut out only the latest part, while Sickle uses a sliding window that cuts sequences from the first dropping point.

It's interesting to see there can be such a huge difference when using two seemingly similar programs. As a bioinformatician it gives yet one more reason to not rely on one single program for one specific task...

ADD REPLYlink written 19 months ago by rioualen310

For choosing proper tools to remove adapters from your reads check this article (Table 1).

ADD REPLYlink modified 19 months ago • written 19 months ago by EagleEye5.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 655 users visited in the last hour