Question

Where Does The Correlation Within And Between Strands In A Dna-Seq Experiment Come From?

3

Entering edit mode

12.7 years ago

KCC ★ 4.1k

Where does the correlation between strands and within strands in a DNA-seq experiment come from? I have found it's present both in the input and the ChIP sequencing data for ChIP-seq experiments. How do we interpret this correlation?

One detailed explanation I have found so far is located here:

It says the following: http://compbio.med.harvard.edu/wiki/display/pub/Quantifying+strand+asymmetry+with+normalized+cross-correlation+function

"A typical ChIP-seq experiment would show a pronounced peak at shift distance approximately equal to the prevalent size of the DNA fragments coming off the IP. This peak indicates that the DNA fragments tend to be clustered around specific positions. In other sequencing experiments, for instance those measuring DNAase I hypersensitivity, this may not be the case: the end points of the fragments may be clustered within broader regions, however complete DNA fragments would not necessarily show strong tendency to center around specific positions. In such cases, one would expect to see a high degree of read clustering, but low strand asymmetry. A cross-correlation function for such data would look almost symmetric with respect to 0 shift, with tails on both sides comparable to those of auto-correlation function."

Another source claims you can identify the average peak width and fragment length using a measure of auto-correlation and comparing it to the cross-correlation with the Crick strands. The peak width being the distance at which the auto-correlation drops to the same value as the value of the intercept on the y-axis: http://biowhat.ucsd.edu/homer/chipseq/qc.html

Can someone explain more completely why there is cross-correlation between strands and auto-correlation within strands, and what kind of information I can hope to get from using this kind of analysis in DNA-seq sequencing data.

correlation • 5.7k views

ADD COMMENT • link updated 10.9 years ago by Biostar 20 • written 12.7 years ago by KCC ★ 4.1k

score 2 · Answer 1 · 2012-11-08

Here is a nice discussion on this - it is not easy to find but I happen to remember it so I looked it up.

In a nutshell the reason seems to be that unique regions in a genome will be unique on both strands and that biases mapping.

Why does ChIP-Seq data have lots of reverse-complement pairs of reads?