Question: Why Chip-seq data cross correlation plot has fragment_lenth cross correlation peak and read-length peaks
1
gravatar for harrypotterandsbt
16 months ago by
harrypotterandsbt70 wrote:

Hello. I am new to Chip-seq data analysis. And I meet the problem when I study the Chip-seq quality assessment.I really can't figure out the why the corss correlation plot has two peaks which are called fragment_length corss correlation peaks and read_length peaks. What's difference between fragment_length and read_length?Does it meaning literally?Why we can assessemnt the signal-to-noise ratios using normalized ratio between the fragment-length cross-correlation peak and the background cross-correlation ? I hope some can help me if you know that or supply some instruction materials! Thank you !

ADD COMMENTlink modified 15 months ago • written 16 months ago by harrypotterandsbt70

I understand the abscissa of fragemnt length peaks. It means shift watson streand by X base them readch the max peason correlation coefficient.But I really don't know the the meaning of fragment length peaks and read-length peaks and why it has some thing to do with the peason correlation coefficient.

ADD REPLYlink written 16 months ago by harrypotterandsbt70
4
gravatar for colin.kern
16 months ago by
colin.kern930
United States
colin.kern930 wrote:

When you do a chip-seq assay, you are isolating fragments of DNA associated with a TF binding site or histone modification, which we'll call the "point of interest". For every DNA fragment, the size of the fragment and exact location of the point of interest on that fragment will vary, but for standard ChIP-seq protocols you generally make libraries with fragment sizes of 200-400. Now, when you sequence these fragments, what you get are reads representing one of the ends of the fragment (assuming single-end sequencing). This are going to be something like 50 or 100 bp long, depending on the kind of sequencing you ordered. So the reads are not the sequences from the entire fragments. All the reads coming from one end of the fragments will align to the same strand of the genome, and all the reads coming from the other side of the fragments will align to the opposite strand. So what you'll get is a peak of reads on the positive strand, and a peak of reads on the negative strand, and these peaks will have a gap between them roughly the size of the average fragment length of your library.

What the cross-correlation does is to measure the pearson correlation of the read depth between the positive and negative strands at each position. Because the two peaks are separated by the fragment length, the pearson correlation will be relatively low. It then starts to iteratively shift the location of all the reads on one strand in one direction, moving the peaks closer and closer, and calculates the pearson correlation at each step. As the peaks start to line up more and more, the pearson correlation increases until it reaches a maximum when the peaks are lined up, and that is theoretically the average fragment length. This is the graph you see created in the cross-correlation plot. The x-axis is how much the reads have been shifted, and the y-axis is the pearson correlation between the read depth of the two strands.

The peak that appears when the reads are shifted by the read length, which is sometimes referred to as the phantom peak,is the result of there being specific locations on a genome that have higher or lower mappability than other locations, so it creates this phantom peak that doesn't tell you anything about your data or its quality. It's just an unavoidable result of the process of aligning reads to a genome.

If your data is very noisy, then the correlation gained when your reads are shifted by the fragment length will not be as high. More of your reads will be background, so the peaks of reads themselves won't be as high, and because you have lots of background reads, you'll have a higher baseline pearson correlation between the strands. This makes it a reasonable way to measure the signal-to-noise ratio of your data.

ADD COMMENTlink modified 16 months ago • written 16 months ago by colin.kern930
0
gravatar for harrypotterandsbt
15 months ago by
harrypotterandsbt70 wrote:

Thank you! colin.kern I got the the different between libarary fragment and sequence reads. But I don't understand why " these peaks will have a gap between them roughly the size of the average fragment length of your library " Are you mean that the the length of gap is 200~400bp. I thought the gap is the length of fragment minus double of the length of reads.

ADD COMMENTlink written 15 months ago by harrypotterandsbt70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2213 users visited in the last hour