I have a human exome (Agilent V5) data set where the GC content plot shows 2 separate peaks. The peaks occur at 42% GC (the higher peak) and the smaller peak is at 58% GC (this is not a sharp but a gradual peak). My previous experience with exome data shows a single peak at about 50% GC. I have pasted the weird GC content curve below.
I mapped to human genome and the mapping rate looks great (>98%). But the mismatch rate (PF_MISMATCH_RATE generated by Picard) for this data is also higher than what I have seen for previous exome data. The weird GC content pattern is also seen in the mapped (and in-target) reads.
So I am wondering what happened here?
An online search says that this could be due to contamination: - I tried mapping against mouse genome which has a mapping rate of 2.5-3.5%. so I have eliminated mouse as a contaminant. - I collected a subset of sequences with high GC (50-60%) . The I arbitrarily chose 10 from among those and blasted them against entire GenBank. The sequences mapped to Human, Gorrilla, and Chimapnzee...so basically genomes that are most similar to human and not contaminant species.
How else can I pinpoint to a contamination source ? Any help is appreciated. Thanks !