Question: Bimodal GC content
gravatar for Folder40g
4.6 years ago by
Folder40g130 wrote:


I'm analysing some human CLL data (cancer, whole exome), and when running fastqc to see how data are I observe all samples do show a bimodal GC content. Generally the only warn shown by Fastqc happens for the GC module, the other normally are good.
I have runned the fastq_screen unly against human genome having a 80% only one hit reads, 18% having multiple hits and about 0.6% not mapping against human, this is making me thinking that no contamination is present in the samples.

After some thoutghs I do not know why samples  do show this kind of distribution.


Thanks for your time.


content gc • 5.0k views
ADD COMMENTlink modified 2.6 years ago by raulAlc10 • written 4.6 years ago by Folder40g130

Used to routinely see this with Agilent human exomes, never had a particularly good explanation for it other than there might be some inherent bias in the baits?

ADD REPLYlink written 4.6 years ago by Daniel Swan13k

Just to back up Dan's answer, I've seen the same in Agilent exomes, and haven't come up with a reasonable explanation. Traditionally with things like RNA seq, this bimodal distribution would make me go straight to the possibility of sample contamination, but with exomes it seems more systematic more than anything else.

ADD REPLYlink modified 8 months ago by RamRS30k • written 4.6 years ago by andrew.j.skelton736.0k

This data were obtained also by whole-exome seq library Agilent SureSelect.

ADD REPLYlink written 4.6 years ago by Folder40g130

Did you ever find a solution for your issue runnerbio?

I wrote a small tool to drill down into BAM statistics like GC% to see if your secondary peak is over-represented in certain reads (certain chromosomes, certain mapping conformations, certain read flags, certain fragment lengths, certain read tags, etc etc).

I haven't published it to github yet, but if you would be interested in 'test driving' it to see if it can help you figure out your issue, I'd be more than willing to give some support as you go along :) Heres a video - skip to about min. 9:00 :)

ADD REPLYlink modified 8 months ago by RamRS30k • written 4.6 years ago by John12k

No I haven't found a reason for this behavior. I don't think there is contamination from bacterias or fungi in theses samples, neither I think that heterogeneity of samples can cause this (this is exome data, I may think that RNA data and heterogeneity in samples could show bimodal GC content). And finally, as said in here by two mates, it seems to be a general "pattern" for Agilent exomes.

I'll take a view of the video, I think it may be worthy to take a look to your tool to see if it gives a answer to the bimodal GC contente in exomes.

ADD REPLYlink written 4.6 years ago by Folder40g130
gravatar for ponizvezdochka
4.3 years ago by
ponizvezdochka40 wrote:

Good news everyone! To be honest we obtain such strange pictures with bimodal distribution of GC in every run. Just finished inspection of one human sample, decided to intersect my bam file reads with exonic and intronic regions downloaded from ucsc - and it fits perfectly.

Thats how it looks in FastQC:

fastqc gc

And this is the same GC plot colored according to its genomic location - you can see there is two main peaks for introns and exons respectively:


So, here is one more possible explanation of bimodal GC content, but it is library-specific. In our lab we use Agilent Focused Exome. Hope this would help!

ADD COMMENTlink written 4.3 years ago by ponizvezdochka40

While I think that's some great detective work Liu, this may not be the answer for some people - for example, if I do the same analysis as you on some data which does not have a bimodal peak, I also get the same breakdown as you got for exonic/intronic GC%

In other words, yes GC% for intronic and exonic DNA is different, but you should still expect to see a normally distributed GC% plot for unbiased/untargeted sequencing when looking at all the reads together.

But it's still very interesting :)

ADD REPLYlink written 4.3 years ago by John12k

Absolutely agree John, in my case library is targeted on exons but there are still some reads map on introns.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by ponizvezdochka40

Ahh i see - ok awesome :) Well that's very interesting then that you only see a few more reads in exons than introns with that assay. Also, your ggplot GC% graph is so much more detailed (for the intron/exon series) than the FASTQ one. I really wish FASTQ would stop smoothing their graphs.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by John12k

I wonder what the plot looks like for the off-target reads that are intergenic

ADD REPLYlink written 4.3 years ago by Daniel Swan13k

Would you mind sharing more details of how you plotted this ? I would like to try it out on my samples. Thanks!

ADD REPLYlink written 19 months ago by msimmer92260

Would the curve have multiple peaks if the sample is rRNA depleted? rRNA depleted samples would have several types of ncRNA besides mRNA that might alter the GC content.

ADD REPLYlink written 13 months ago by Arindam Ghosh300
gravatar for thackl
4.6 years ago by
thackl2.8k wrote:

I don't have particular experience with either human nor exome sequencing, but I came across similar distributions in genome sequencing projects. Among others, I have observed it for a highly repetitive plant. In that case, the second peek corresponded to specific repeat class, that was really highly abundant in the data set.

Giving your mapping result, I concur, contamination is unlikely. So I would try to figure out from which locations of the genome these high GC reads derive and whether you can associate that with some useful annotations. Based on your mappings, you could extract regions from the genome with proper reads coverage, e.g. with bedtools, and than look for entire sequences or large windows of high GC.

ADD COMMENTlink written 4.6 years ago by thackl2.8k

Hello, dear thackl I was running a denovo rnaseq expriment on a plant.similarity, my fastq GC content result is bimodal. Is it possible for you to more explain about "the second peak corresponded to specific repeat class"? I think it is depended to existance of chloroplast genome, what is your idea? best regards

ADD REPLYlink written 3.0 years ago by eyonesi40
gravatar for raulAlc
2.6 years ago by
raulAlc10 wrote:

Hi! I recently stumbled upon this nice little example of a bimodal distribution of GC content for an WG-Seq of orange. We were suspecting possible contamination. Upon blasting some of the reads with high %GC, I came upon hits that looked like: "C.limon DNA for clsat_9 satellite" (satellite DNA), looking at the citation ( ) I did corroborate that Citrus are rich in satellite DNA which has a GC-content between 60% and 68%. So that explained our secondary peak. Cool!

GC content in orange

ADD COMMENTlink written 2.6 years ago by raulAlc10
gravatar for thackl
3.0 years ago by
thackl2.8k wrote:

I don't think that you can necessarily extend the observations made above to directly to RNASeq experiments. Also, I don't really know if a bimodal GC distribution is something to be concerned about in the first place when looking at RNAseq. You might need to talk to people more involved with RNASeq. Sorry.

ADD COMMENTlink written 3.0 years ago by thackl2.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 948 users visited in the last hour