How To Do Normalization Of Two Chip-Seq Data?
6
12
Entering edit mode
12.4 years ago
yatung ▴ 200

As title, I am curious about how to do normalization between two Chip-seq data. When we are doing quantification analysis between two Chip-seq data, how can we know that the differences between two samples are due to the different condition?

Why I have this question is that I am currently reading a paper "Epigenetic Regulation of Learning and Memory by Drosophila EHMT/G9a "

The article mentioned that

To compensate for differences in sequencing depth and mapping efficiency among the two ChIP-seq samples, the total number of unique tags of each sample was uniformly equalized relative to the sample with the lowest number of tags (7,043,913 tags), allowing for quantitative comparisons.

I just don't get the point here that how the normalization is done.

chip-seq • 28k views
ADD COMMENT
3
Entering edit mode

Hello! Hijacking this post a bit. Since it is been almost 3 years this was asked I was wondering if people still do this, the total number of reads normalization? I m having a hard time comparing two ChIP-seq datasets (normalized by their input). One of the IP libraries is really big compared to the other one and I find that MACS is not showing all the peaks it should. So I guess I have two problems : the first one is one of the IP is not showing all the peaks it should and the second one is, how do I compare my two IP libraries if they do not have the same number of reads to start with?

I posted a question in this forum too in case of, with more details to my pipeline. Thanks!

Comparing two ChIP-seq libraries

Rita

ADD REPLY
0
Entering edit mode

it is not a good idea to post a new question in the answer section of an old question - we're not really a forum where threads go on and on - the value of the site is in having one question with answers following to that specific question. Moreover posting here makes the question a lot less visible and far fewer people will take note of it.

I have moved your answer (which was a question really to the comment section of the main post)

ADD REPLY
0
Entering edit mode

MAnorm: To circumvent the issue of differences in S/N ratio between samples, we focused on ChIP-enriched regions (peaks), and introduced a novel idea, that ChIP-Seq common peaks could serve as a reference to build the rescaling model for normalization. This approach is based on the empirical assumption that if a chromatin-associated protein has a substantial number of peaks shared in two conditions, the binding at these common regions will tend to be determined by similar mechanisms, and thus should exhibit similar global binding intensities across samples. This idea is further supported by motif analysis that we present.

ADD REPLY
18
Entering edit mode
12.4 years ago

Most likely they simply divided the number of reads in larger sample with the number of reads in the smaller sample. Then they used this factor to divide whatever read count they got per each peak for the larger sample.

Imagine that they had 1000 total reads for sample1 and 5000 total reads for sample2. The ratio relative to the smaller number is 5000/1000 = 5 so we expect on average that the peaks in sample2 to contain five times as many reads than in in sample1. To allow for direct comparison between sample1 and sample2 they divide the read counts for peaks in sample2 by 5.

The reason to average to smaller factor is to reduce more data to less rather then boost less data to be more.

ADD COMMENT
1
Entering edit mode

Very clear!!!! That really helps a lot!

ADD REPLY
0
Entering edit mode

Would one have to first normalize to total reads? For example in one of the GEO set I downloaded recently the one sample had 65 million spot # whilst the other had 63 million. In this case is it necessary to first scale by total counts first? Thanks!

ADD REPLY
7
Entering edit mode
12.4 years ago

This recent publication suggests a more sophisticated way to normalize ChIP-seq data sets.

ADD COMMENT
0
Entering edit mode

nice, I forgot to mention in my answer that the normalization as described in the original poster's question is somewhat simplistic - but a good start

ADD REPLY
0
Entering edit mode

sorry. Cannot open this link now.

ADD REPLY
3
Entering edit mode
12.4 years ago

I usually do this a lot of times, so just check the number of lines in both files, if its bed and store the lowest size of the library among those (mostly sample vs control) and then randomly remove the extra line from the big dataset, by this the datasets are normalized. But one should be cautious as using MockIP (IgG) control has 30-50% less reads then control, so when this normalization is done, you lose information. So, what I think better is to sequence double the amount of control and then the datasets are somewhat comparable and you dont lose information from your sample. I can also, send you a normalization code if you want in R, very easy

Cheers

ADD COMMENT
2
Entering edit mode
10.8 years ago
dnaseiseq ▴ 220

Bailey T, Krajewski P, Ladunga I, Lefebvre C, Li Q, et al. (2013) Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data. PLoS Comput Biol 9(11): e1003326. doi:10.1371/journal.pcbi.1003326

http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003326

ADD COMMENT
0
Entering edit mode

this is a really nice overview - we will try to find a way to feature these more prominently

ADD REPLY
1
Entering edit mode
11.2 years ago
shenli.sam ▴ 190

Normalization by sequencing depth (i.e. total read count) is probably the simplest approach but is widely used. There are some drawbacks with this approach. For example, if the signal-to-noise ratio is very different between two libraries, one library is going to contain more background reads than the other. However, these background reads are taken into consideration when you calculate total read counts. This will certainly cause bias in your estimation.

In diffReps, a better approach is taken to do normalization. Basically, the low count regions are first removed from consideration, then a normalization ratio is calculated for each library and each of the regions left. Finally, the medians of the ratios are used as normalization factors. This way, a relatively unbiased, robust estimate can be used for normalization.

ADD COMMENT
0
Entering edit mode

If the percentage of duplicated reads is very different between the two samples, then the final number of usable reads (e.g. by MACS) will still be quite different.

ADD REPLY
0
Entering edit mode
12.4 years ago
Frenkiboy ▴ 260

If the difference in the total number of reads in two samples is not too big, sometimes people

just randomly down sample the bigger one.

ADD COMMENT

Login before adding your answer.

Traffic: 1139 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6