Why are there two peaks in per sequence GC content plot from FastQC report?
0
0
Entering edit mode
2.8 years ago
Meng • 0

Hi folks,

I know many people have asked similar questions, but I did not find any good interpretations for my data, so I posted a new question here. I ran FastQC for the whole genome sequencing (by Illumina) data of a rodent, and found the per sequence GC content was not shaped as a normal distribution. As you can see from the following figure, two peaks present in the plot, and I am not sure if that is a coincidence, the GC content value of the 2nd peak (76%) is twice as big as that of the 1st peak (38%).

enter image description here

The first explanation came to my mind was contamination, and perhaps blast can help to identify this. To validate this idea, I extracted the sequences with 76% GC content from the raw Illumina data, but no contamination was detected. Then I was thinking using RepeatMasker to find repeats in those reads, but the amount was not sufficient to form a 2nd peak, so repeats are not likely the reason, either.

I also aligned those sequences with 76% GC content to our assembly (assembled by PacBio sequences), and almost all of them can map to certain regions, but the GC-content distribution of assembly is a normal distribution with only one peak, as well as the per sequence GC content of PacBio reads.

And I hope this information is helpful. We also sequenced Hi-C for this animal, and surprisingly (or maybe not), the per sequence GC content plots have two peaks as well! So I wondered if this problem relates to the sequencing techniques, and is it an coincidence that the two peak GC contents have a doubled relation?

distribution WGS sequencing GC Illumina • 1.5k views
ADD COMMENT

Login before adding your answer.

Traffic: 1770 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6