Question

How to process data from Illumina sequencing ?

0

Entering edit mode

4.3 years ago

vicslab • 0

Hi!

I have data from a bacterial genome sequenced by Illumina technology. I have to check QC and remove adapters and poor quality sequences. I saw trimmomatic is a good option, but I have some questions:

How can I choose average quality? What should I consider to choose a number to fill up a parameter?

What means: LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 ?

genome trimmomatic illumina sequencing data • 1.1k views

ADD COMMENT • link updated 4.3 years ago by Michael 54k • written 4.3 years ago by vicslab • 0

1

Entering edit mode

Have you looked at the Trimmomatic manual for explanation of those parameters? You may also want to take a look at bbduk.sh from BBMap suite. There is a guide available here. Easy to understand options though there are a lot of them.

ADD REPLY • link 4.3 years ago by GenoMax 141k

0

Entering edit mode

Yes, I am reading it. There is an explanation like: SLIDINGWINDOW:<windowsize>:<requiredquality> windowSize: specifies the number of bases to average across requiredQuality: specifies the average quality required

SLIDINGWINDOW: Performs a sliding window trimming approach. It starts scanning at the 5‟ end and clips the read once the average quality within the window falls below a threshold.

But I am not sure about the threshold I should use.

ADD REPLY • link 4.3 years ago by vicslab • 0

score 2 · Answer 1 · 2020-01-14

Do the QC first in order to check if you really need quality trim and to remove adapters. Use e.g. fastQC and MultiQC for this job. If there is no considerable adapter contamination and all reads have good to very good quality, you might not need to do trimming at all. I would say: trimming is mostly not required anymore. For the purpose of assembly you might do it anyway just to be conservative. trimmomatic will give you the number or proportion of reads trimmed.

For more details on the trimming parameters, please read the fine manual: http://www.usadellab.org/cms/index.php?page=trimmomatic

In short, I assume you have paired-end and want to assemble a genome, then try something like the recommendation here:

java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:15 MINLEN:36

Ask your sequencing center for the TruSeq protocol version and eventually non-standard adapters.

LEADING:3 TRAILING:3 trims leading and trailing sequences if under a threshold. This is likely not needed.

SLIDINGWINDOW:4:15 sets a minimum average threshold of 15 for a window of size 4.

I guess that you could set the threshold also to 30 and still keep 90% of your reads. The threshold is the same value as given by FastQC on the y-axis of the Read quality plot. I think 25 marks the lower limit of the green zone in FastQC. I am personally not totally convinced that Illumina quality scores have a good bearing in reality though.

From the manual page:

Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
Remove leading low quality or N bases (below quality 3) (LEADING:3)
Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)
Drop reads below the 36 bases long (MINLEN:36)