Question: Identifying contaminant sources
gravatar for rachanaj
2.3 years ago by
rachanaj0 wrote:


I have a human exome (Agilent V5) data set where the GC content plot shows 2 separate peaks. The peaks occur at 42% GC (the higher peak) and the smaller peak is at 58% GC (this is not a sharp but a gradual peak). My previous experience with exome data shows a single peak at about 50% GC. I have pasted the weird GC content curve below.

I mapped to human genome and the mapping rate looks great (>98%). But the mismatch rate (PF_MISMATCH_RATE generated by Picard) for this data is also higher than what I have seen for previous exome data. The weird GC content pattern is also seen in the mapped (and in-target) reads.

So I am wondering what happened here?

An online search says that this could be due to contamination: - I tried mapping against mouse genome which has a mapping rate of 2.5-3.5%. so I have eliminated mouse as a contaminant. - I collected a subset of sequences with high GC (50-60%) . The I arbitrarily chose 10 from among those and blasted them against entire GenBank. The sequences mapped to Human, Gorrilla, and basically genomes that are most similar to human and not contaminant species.

How else can I pinpoint to a contamination source ? Any help is appreciated. Thanks !


gcbias exome • 859 views
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by rachanaj0

The ribosomal RNA genes have a biased GC content; have you checked to see if those are the cause?

Follow-up: the Alu repeats also contain GC-rich sequences derived from 7SL RNA. You can check for over-represented k-mers in your data to see if there's a match.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by harold.smith.tarheel4.3k

Hi @harold.smith.tarheel

I have never dealt with rRNA/rDNA data so I am trying to understand your comment and to figure what I can do. Please let me know if I am misunderstanding your comment. You are saying that there could be a possibility of the reads having over-representation of rRNA genes which is skewing the GC content plot. So if I zoom in on rRNA genes, I should see a higher than average coverage for my data.

One easy way of doing this is to look for over-represented kmers in my data to see if they align to the 7SL RNA sequence.


ADD REPLYlink written 2.3 years ago by rachanaj0

Total RNA is typically >95% rRNA. If your exome enrichment was not very efficient, then the presence of abundant rRNA (which is GC-rich) in your library would produce a secondary peak. Likewise, contamination with the highly abundant Alu repeats would do the same.

Note that these 'contaminants' are actually derived from your genome of interest and would be predicted to map. But your ability to visualize them in a viewer is dependent upon the alignment software: some tools assign repeat sequences to a single locus, some assign them randomly to the different repeats, and some simply flag them as multi-mappers without assignment to any locus. A further confounder is the reference genome: some versions mask the repeat sequences.

However, the presence of repetitive sequences like Alu elements should be highly over-represented in your data and therefore detectable by a k-mer counter.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by harold.smith.tarheel4.3k

You may need to align your data to human rDNA repeat independently (it is likely that the genome you used does not have a copy of that repeat).

That said you have reported that you got 98% alignment from your data so whatever is there in your data is also in the reference you used.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax62k

You might want to take a look at this: Bimodal GC content

ADD REPLYlink written 2.3 years ago by Daniel Swan13k

Have you tried to align the data to the human genome (that is not clear from the post above)? It is possible that perhaps you are worried for no reason and the data may align as normal.

If you are certain there may be contamination then you could use BBSplit from BBMap to bin your reads and then look at the "non-human" pool of reads closely.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax62k

I am sorry I should have been clearer. I have already mapped the reads to human genome. The mapping rate is great >98%. The only thing that looked suspicious in mapping statistics was the higher mismatch rate. But then I looked into the GC content of the mapped reads. And I still see the weird GC bias. I further looked into mapped reads that fall into the target region hoping that the weird GC reads will fall into non targeted region but nope. The mapped reads within the targeted region also have the weird GC.

So in short I cannot just look at non mapped reads to find out the contamination source. The contamination seems to be also within the mapped reads.

ADD REPLYlink written 2.3 years ago by rachanaj0

If you don't have reads that are mapping off-target, reads mapping normally otherwise then this may just be a one time thing of how this particular library got made (fragmentation method used, PCR bias, am just speculating here).

A suggestion: Instead of sharing a link from google drive you may want to post an image (choose one of the free hosting providers) and then share the link in your original post to show the GC "contamination".

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax62k

Yes @genomax2 that is also a possibility. I am working with library prep team to find out if something happened there. But meanwhile I was also wondering if I can do something bioinformatically. But thanks for thinking about this.

ADD REPLYlink written 2.3 years ago by rachanaj0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1240 users visited in the last hour