Question

MultiQC output shows that data processing steps are ineffective

0

Entering edit mode

2.1 years ago

melissachua90 ▴ 70

I performed MultiQC to check the quality of reads before and after data processing. The MultiQC reports did not show significant improvement. May I know why?

Pre-processing status check

First I trimmed the adapters using BBDuk.

for f in `ls -1 *_1.fastq | sed 's/_1.fastq//'`;
do bbduk.sh -Xmx20g in1=$f\_1.fastq in2=$f\_2.fastq out1=../clean_data/$f\_1.fq out2=../clean_data/$f\_2.fq ref=../adapters.fa ktrim=r k=25 mink=10 ftm=5 tbo tpe;
Done

Second, I performed quality trimming:

for f in `ls -1 *_1.fq.gz | sed 's/_1.fq.gz//'`;
do bbduk.sh -Xmx20g in1=$f\_1.fq.gz in2=$f\_2.fq.gz out1=../trimmed_data/$f\_1.fq out2=../trimmed_data/$f\_2.fq qtrim=r trimq=10 maq=10;
Done

Third, I performed error correction using Musket:

for f in `ls -1 *.fq.gz | sed 's/.fq.gz//'`;
do ./../../musket-1.1/musket -k 21 536879812 -p 20 -zlib 9 -o ../corrected_data/$f\.fq.gz $f\.fq.gz;
Done
for f in `ls -1 *.fq.gz`;

Post-processing Status check

FastQC trimming • 834 views

ADD COMMENT • link updated 12 months ago by Ram 43k • written 2.1 years ago by melissachua90 ▴ 70

score 3 · Answer 1 · 2022-04-28

3

Entering edit mode

2.1 years ago

ATpoint 82k

Adapters are gone, and that is the only really relevant metric in fastqc unless you would see a bad per-base quality indicating sequencing failure.

ADD COMMENT • link 2.1 years ago by ATpoint 82k

score 3 · Answer 2 · 2022-04-28

There is clear improvement - more green and gold, less red. No adapters because of your first step, and better per-base sequence quality because of your second step. Error correction may or may not change the sequence of reads, but it will not change the base quality. So I would not expect anything from the third step that fastqc can detect in qualitative terms.

score 2 · Answer 3 · 2022-04-28

Since you are already using BBDuk from the BBTools suite, you might also want to run BBNorm. Like Musket, it can also error-correct reads, but also filter reads based on k-mer content. Since you have a lot of duplication and over-represented sequences, you may want to discard those reads. (of course, only if it is not a quantitative experiment like RNA-seq). The default clumpify.sh step however should also give you a quite good result.

Apart from this, a typical preprocessing with BBTools may look like this:

#Sequence-based deduplication (optical is only possible if read headers are intact which is often not the case with SRA)
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical

#Remove low-quality regions
#This step requires standard Illumina read headers and will not work with renamed reads, such as most SRA data.
filterbytile.sh in=clumped.fq.gz out=filtered_by_tile.fq.gz

#Trim adapters
bbduk.sh in=filtered_by_tile.fq.gz out=trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=100 ref=bbmap/resources/adapters.fa ftm=5 ordered

#Remove synthetic artifacts and spike-ins.  Add "qtrim=r trimq=8" to also perform quality-trimming at this point, but not if quality recalibration will be done later.
bbduk.sh in=trimmed.fq.gz out=filtered.fq.gz k=27 ref=bbmap/resources/sequencing_artifacts.fa.gz,bbmap/resources/phix174_ill.ref.fa.gz ordered