wrong quality plots in fastqc output
1
0
Entering edit mode
13 months ago
poecile.pal ▴ 50

Good morning,

I simulated reads based on the reference genome using samtools wgsim

wgsim -N 30000000 -1 151 -2 151 -r 0 -R 0 -X 0 -e 0 genome.fasta Sample_R1.fastq Sample_R2.fastq

and obtained fastq files with such content:

@DQ898156.1_36602_37076_0:0:0_0:0:0_0/1
CTGTAGTCTGGCACTGCAAAAACAGGATACAGGTGTATATATGATATATATATATGTGTGGACATGTTGTGTATAAAGAACGAAAAAATGCGGATATGGTCGAATGGTAAAATTTCTCTTTGCCAAGGAGAAGATGCGGGTTCGATTCCCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@DQ898156.1_147753_148277_0:0:0_0:0:0_1/1
GGGATCCTCGCGGACAGAAAAAGATTGCAGTCAGTTTGATAATGATCGAGTGACATTGCTTCTTCGGCCCGAACCAAGGAATCCCTTAGATATGATGCAAAACGGATCTTGTTCTATCCTTGATCAGAGATTTCTCTATGAAAAAAACGAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Then I launched fastqc. Surprisingly, Per base sequence quality plots are bad:

fastqc per base qulity plot

At the same time, I corresponds to a high level of phred quality!

Also I see 99.1% Dups in the report. But simply scrolling through the fastq file shows me that this is not the case. enter image description here

Could you please explain me, what is the reason for such an unexpected fastqc result? (Maybe the fastq encoding was incorrectly recognized) Will other programs work correctly with my simulated data (like bwa-mem2)?

Best regards, Poecile

samtools fastqc fastq wgsim • 599 views
ADD COMMENT
2
Entering edit mode
13 months ago
GenoMax 141k

Per base sequence quality plots are bad

How so? Because your phred scores are so high they are not even showing up on your fastqc plot since Y-axis only goes up to Q34.

But simply scrolling through the fastq file shows me that this is not the case.

FastQC only looks at the first 100K reads when it is working on deduplication. It also trims reads over 75 bp down to 50 bp to keep memory requirement under control. I don't know how wgsim simulates the reads but if those reads happen to be represented multiple times later in the file then you will see the result you have.

ADD COMMENT
0
Entering edit mode

Thank you so much for such a quick response!

Everything is clear with plots now.

As for duplication, it's a pity, I was hoping that it was a fastqс error :) Perhaps this is due to the fact that I used platome as a reference, with IRA and IRB... But they do not occupy such a large % of the sequence.

ADD REPLY
0
Entering edit mode

I understood why I got high percentage of duplication. The size of reference was about 150K bp, while I asked for 30M reads with length 151 bp. Of course, they strongly overlap.

ADD REPLY

Login before adding your answer.

Traffic: 2636 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6