Question: How to process data from Illumina sequencing ?
gravatar for vicslab
10 months ago by
vicslab0 wrote:


I have data from a bacterial genome sequenced by Illumina technology. I have to check QC and remove adapters and poor quality sequences. I saw trimmomatic is a good option, but I have some questions:

How can I choose average quality? What should I consider to choose a number to fill up a parameter?


ADD COMMENTlink modified 10 months ago by Michael Dondrup48k • written 10 months ago by vicslab0

Have you looked at the Trimmomatic manual for explanation of those parameters? You may also want to take a look at from BBMap suite. There is a guide available here. Easy to understand options though there are a lot of them.

ADD REPLYlink modified 10 months ago • written 10 months ago by GenoMax92k

Yes, I am reading it. There is an explanation like: SLIDINGWINDOW:<windowsize>:<requiredquality> windowSize: specifies the number of bases to average across requiredQuality: specifies the average quality required

SLIDINGWINDOW: Performs a sliding window trimming approach. It starts scanning at the 5‟ end and clips the read once the average quality within the window falls below a threshold.

But I am not sure about the threshold I should use.

ADD REPLYlink modified 10 months ago • written 10 months ago by vicslab0
gravatar for Michael Dondrup
10 months ago by
Bergen, Norway
Michael Dondrup48k wrote:

Do the QC first in order to check if you really need quality trim and to remove adapters. Use e.g. fastQC and MultiQC for this job. If there is no considerable adapter contamination and all reads have good to very good quality, you might not need to do trimming at all. I would say: trimming is mostly not required anymore. For the purpose of assembly you might do it anyway just to be conservative. trimmomatic will give you the number or proportion of reads trimmed.

For more details on the trimming parameters, please read the fine manual:

In short, I assume you have paired-end and want to assemble a genome, then try something like the recommendation here:

java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:15 MINLEN:36

Ask your sequencing center for the TruSeq protocol version and eventually non-standard adapters.

LEADING:3 TRAILING:3 trims leading and trailing sequences if under a threshold. This is likely not needed.

SLIDINGWINDOW:4:15 sets a minimum average threshold of 15 for a window of size 4.

I guess that you could set the threshold also to 30 and still keep 90% of your reads. The threshold is the same value as given by FastQC on the y-axis of the Read quality plot. I think 25 marks the lower limit of the green zone in FastQC. I am personally not totally convinced that Illumina quality scores have a good bearing in reality though.

From the manual page:

Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
Remove leading low quality or N bases (below quality 3) (LEADING:3)
Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)
Drop reads below the 36 bases long (MINLEN:36)
ADD COMMENTlink modified 10 months ago • written 10 months ago by Michael Dondrup48k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2006 users visited in the last hour