Why are there two peaks in per sequence GC content plot from FastQC report?
12 weeks ago
Hi folks,

I know many people have asked similar questions, but I did not find any good interpretations for my data, so I posted a new question here. I ran FastQC for the whole genome sequencing (by Illumina) data of a rodent, and found the per sequence GC content was not shaped as a normal distribution. As you can see from the following figure, two peaks present in the plot, and I am not sure if that is a coincidence, the GC content value of the 2nd peak (76%) is twice as big as that of the 1st peak (38%).

The first explanation came to my mind was contamination, and perhaps blast can help to identify this. To validate this idea, I extracted the sequences with 76% GC content from the raw Illumina data, but no contamination was detected. Then I was thinking using RepeatMasker to find repeats in those reads, but the amount was not sufficient to form a 2nd peak, so repeats are not likely the reason, either.

I also aligned those sequences with 76% GC content to our assembly (assembled by PacBio sequences), and almost all of them can map to certain regions, but the GC-content distribution of assembly is a normal distribution with only one peak, as well as the per sequence GC content of PacBio reads.

And I hope this information is helpful. We also sequenced Hi-C for this animal, and surprisingly (or maybe not), the per sequence GC content plots have two peaks as well! So I wondered if this problem relates to the sequencing techniques, and is it an coincidence that the two peak GC contents have a doubled relation?

