Hi, I am wondering if anybody can share some experience with the GC% bias in the reads of a ChIP-seq or INPUT DNA sample. Thanks.
Hi, I am wondering if anybody can share some experience with the GC% bias in the reads of a ChIP-seq or INPUT DNA sample. Thanks.
It depends on the sequencing technology used, but in all cases I looked at (using Illumina Genome Analyzers), there's definitely been a positive correlation between GC content and ChIP-seq read density. This has been noted especially when looking at "negative" controls: Using an antibody that should not bind anything, you still see a clear enrichment for nucleosome occupied regions. Someone even turned this into a method called Sono-Seq.
First, I found the HOMER tools to be a reasonably good way of looking at GC bias across samples (see http://biowhat.ucsd.edu/homer/chipseq/qc.html the "Sequence Bias" section). I've seen the curves be non-linear across the read length (see Dr. Chris Benner's example of biased sample) and it sometimes correlates informally with a higher percentage of clonal reads. There are plenty of people who just sequenced the adapter 8 million times -- those plots look more skewed than the ones with real data...
But for me the odd observation has been when the baselines differ by different treatments. I define baseline as "far up- or down-stream the mapped read" and should in theory be the global GC% for the genome. I've seen the baseline GC% vary across samples, by different treatments usually.
The ChIP and Seq parts usually include a PCR step or steps for probe enrichment, so you anyway have a under-representation of GC/AT-rich reads. If you are interested in those reads then you could run several experiments with different temperatures for PCR steps. But I've never seen such experiments.
I've investigated something relatively similar to your question, which is:
(a) what is the fraction of all the reads in a chip-seq experiment that contain a low complexity signature, according to the 'dust' formula implemented in megablast.
For the datasets I've tried, this value ranges from 5-15% both for A/T and C/G, so high C/G is about half of that.
(b) what is the fraction of all the reads in a chip-seq experiment that are repeated. The definition of repeated is that there is a significant sequence overlap between the repeated reads so that they can be clustered together, or that they map to multiple repetitive regions in the genome.
Here, for the datasets I've tried, this value ranges from 10-25%.
you can also intersect the Chip Peaks with the CpGislands data in the UCSC genome browser. That will give you a good estimate.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.