Good morning,
I simulated reads based on the reference genome using samtools wgsim
wgsim -N 30000000 -1 151 -2 151 -r 0 -R 0 -X 0 -e 0 genome.fasta Sample_R1.fastq Sample_R2.fastq
and obtained fastq files with such content:
@DQ898156.1_36602_37076_0:0:0_0:0:0_0/1
CTGTAGTCTGGCACTGCAAAAACAGGATACAGGTGTATATATGATATATATATATGTGTGGACATGTTGTGTATAAAGAACGAAAAAATGCGGATATGGTCGAATGGTAAAATTTCTCTTTGCCAAGGAGAAGATGCGGGTTCGATTCCCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@DQ898156.1_147753_148277_0:0:0_0:0:0_1/1
GGGATCCTCGCGGACAGAAAAAGATTGCAGTCAGTTTGATAATGATCGAGTGACATTGCTTCTTCGGCCCGAACCAAGGAATCCCTTAGATATGATGCAAAACGGATCTTGTTCTATCCTTGATCAGAGATTTCTCTATGAAAAAAACGAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Then I launched fastqc
. Surprisingly, Per base sequence quality plots are bad:
At the same time, I
corresponds to a high level of phred quality!
Also I see 99.1% Dups in the report. But simply scrolling through the fastq file shows me that this is not the case.
Could you please explain me, what is the reason for such an unexpected fastqc result? (Maybe the fastq encoding was incorrectly recognized) Will other programs work correctly with my simulated data (like bwa-mem2
)?
Best regards, Poecile
Thank you so much for such a quick response!
Everything is clear with plots now.
As for duplication, it's a pity, I was hoping that it was a fastqс error :) Perhaps this is due to the fact that I used platome as a reference, with IRA and IRB... But they do not occupy such a large % of the sequence.
I understood why I got high percentage of duplication. The size of reference was about 150K bp, while I asked for 30M reads with length 151 bp. Of course, they strongly overlap.