Hello. I am new to Chip-seq data analysis. And I meet the problem when I study the Chip-seq quality assessment.I really can't figure out the why the corss correlation plot has two peaks which are called fragment_length corss correlation peaks and read_length peaks. What's difference between fragment_length and read_length?Does it meaning literally?Why we can assessemnt the signal-to-noise ratios using normalized ratio between the fragment-length cross-correlation peak and the background cross-correlation ? I hope some can help me if you know that or supply some instruction materials! Thank you !
When you do a chip-seq assay, you are isolating fragments of DNA associated with a TF binding site or histone modification, which we'll call the "point of interest". For every DNA fragment, the size of the fragment and exact location of the point of interest on that fragment will vary, but for standard ChIP-seq protocols you generally make libraries with fragment sizes of 200-400. Now, when you sequence these fragments, what you get are reads representing one of the ends of the fragment (assuming single-end sequencing). This are going to be something like 50 or 100 bp long, depending on the kind of sequencing you ordered. So the reads are not the sequences from the entire fragments. All the reads coming from one end of the fragments will align to the same strand of the genome, and all the reads coming from the other side of the fragments will align to the opposite strand. So what you'll get is a peak of reads on the positive strand, and a peak of reads on the negative strand, and these peaks will have a gap between them roughly the size of the average fragment length of your library.
What the cross-correlation does is to measure the pearson correlation of the read depth between the positive and negative strands at each position. Because the two peaks are separated by the fragment length, the pearson correlation will be relatively low. It then starts to iteratively shift the location of all the reads on one strand in one direction, moving the peaks closer and closer, and calculates the pearson correlation at each step. As the peaks start to line up more and more, the pearson correlation increases until it reaches a maximum when the peaks are lined up, and that is theoretically the average fragment length. This is the graph you see created in the cross-correlation plot. The x-axis is how much the reads have been shifted, and the y-axis is the pearson correlation between the read depth of the two strands.
The peak that appears when the reads are shifted by the read length, which is sometimes referred to as the phantom peak,is the result of there being specific locations on a genome that have higher or lower mappability than other locations, so it creates this phantom peak that doesn't tell you anything about your data or its quality. It's just an unavoidable result of the process of aligning reads to a genome.
If your data is very noisy, then the correlation gained when your reads are shifted by the fragment length will not be as high. More of your reads will be background, so the peaks of reads themselves won't be as high, and because you have lots of background reads, you'll have a higher baseline pearson correlation between the strands. This makes it a reasonable way to measure the signal-to-noise ratio of your data.
Thank you! colin.kern I got the the different between libarary fragment and sequence reads. But I don't understand why " these peaks will have a gap between them roughly the size of the average fragment length of your library " Are you mean that the the length of gap is 200~400bp. I thought the gap is the length of fragment minus double of the length of reads.