I've been creating autocorrelation plots for my ChIP-Seq data, and I'm seeing a peak at the read length, just as described here:
... as the quality of samples decrease, the signal from tags mapping as palindromes increases. For example, the input (mock IP) control sample below shows a strong spike at 23 bp, which corresponds to the length of sequence used to map that sample.
Having an "opposite strand" peak at the read length indicates that in many cases pairs of reads are mapping as approximate reverse complements of each other. That is, if a read maps at position X on the plus strand, then it is likely that another read will map at position X + read length on the minus strand. Here is an ASCII art diagram of what is happening:
---------------------> ========================================================== <---------------------
Can anyone explain why many pairs of tags would map as reverse complements of each other? The data is single-end Illumina sequencing. The linked article calls this "the signal from tags mapping as palindromes". I suppose this means that a read could map as its own reverse complement, and that if many reads do so, they will produce this peak. However, I wrote a script to see if any of my reads are identical to their reverse complement, and it didn't find any.
This is not a plot of coverage depth vs genomic coordinate. This is a so-called "autocorrelation plot", which can be described as follows: For every distinct pair of reads in the data set, subtract their starting posisions. Split these differences into three groups: both reads on plus strand, both reads on minus strand, and one of each. Make a histogram of the resulting groups. The plot above shows the first two groups in blue, and the third group in red.
In a good sample, I expect a red peak centered at the footprint size of my protein of interest. For example, I am doing MNase-seq for studying histone modifications, so I expect a peak at around 150 bp, the approximate length of DNA in one nucleosome. I expect this peak to always occur in the same location regardless of the read length that was used during sequencing.
In contrast, I also see a peak in my data around 40 bp, which is exactly the read length used in the sequencing run. This peak occurs even in the control, in which no IP was performed at all and no significant peaks corresponding to nucleosome sizes are visible. Based purely on the fragmentation process and IP pulldown, I cannot imagine how this peak would occur. It seems like it must be an artifact of the sequencing process, or the wet-lab prep before sequencing.
So, to be clear, I understand why there would be a peak at the x-coordinate corresponding to the footprint size of my protein (i.e. fragment length of DNA). That's why I made the plots in the first place. What I don't understand is the other peak that corresponds to the read length.