I'm analysing some human CLL data (cancer, whole exome), and when running fastqc to see how data are I observe all samples do show a bimodal GC content. Generally the only warn shown by Fastqc happens for the GC module, the other normally are good.
I have run fastq_screen only against human genome having a 80% only one hit reads, 18% having multiple hits and about 0.6% not mapping against human, this is making me thinking that no contamination is present in the samples.
After some thought I do not know why samples do show this kind of distribution.
Good news everyone!
To be honest we obtain such strange pictures with bimodal distribution of GC in every run. Just finished inspection of one human sample, decided to intersect my bam file reads with exonic and intronic regions downloaded from ucsc - and it fits perfectly.
Thats how it looks in FastQC:
And this is the same GC plot colored according to its genomic location - you can see there is two main peaks for introns and exons respectively:
So, here is one more possible explanation of bimodal GC content, but it is library-specific. In our lab we use Agilent Focused Exome.
Hope this would help!
I don't have particular experience with either human nor exome sequencing, but I came across similar distributions in genome sequencing projects. Among others, I have observed it for a highly repetitive plant. In that case, the second peek corresponded to specific repeat class, that was really highly abundant in the data set.
Giving your mapping result, I concur, contamination is unlikely. So I would try to figure out from which locations of the genome these high GC reads derive and whether you can associate that with some useful annotations. Based on your mappings, you could extract regions from the genome with proper reads coverage, e.g. with bedtools, and than look for entire sequences or large windows of high GC.
Hi! I recently stumbled upon this nice little example of a bimodal distribution of GC content for an WG-Seq of orange. We were suspecting possible contamination. Upon blasting some of the reads with high %GC, I came upon hits that looked like: "C.limon DNA for clsat_9 satellite" (satellite DNA), looking at the citation ( https://link.springer.com/article/10.1007/s001220100719 ) I did corroborate that Citrus are rich in satellite DNA which has a GC-content between 60% and 68%. So that explained our secondary peak. Cool!
I don't think that you can necessarily extend the observations made above to directly to RNASeq experiments. Also, I don't really know if a bimodal GC distribution is something to be concerned about in the first place when looking at RNAseq. You might need to talk to people more involved with RNASeq. Sorry.