Question: How To Do Normalization Of Two Chip-Seq Data?
gravatar for yatung
8.2 years ago by
yatung200 wrote:

As title, I am curious about how to do normalization between two Chip-seq data. When we are doing quantification analysis between two Chip-seq data, how can we know that the differences between two samples are due to the different condition?

Why I have this question is that I am currently reading a paper

"Epigenetic Regulation of Learning and Memory by Drosophila EHMT/G9a "

The article mentioned that "To compensate for differences in sequencing depth and mapping efficiency among the two ChIP-seq samples, the total number of unique tags of each sample was uniformly equalized relative to the sample with the lowest number of tags (7,043,913 tags), allowing for quantitative comparisons." I just don't get the point here that how the normalization is done.

bioinformatics chip-seq • 20k views
ADD COMMENTlink modified 5.6 years ago by ritarebollo70 • written 8.2 years ago by yatung200

Hello! Hijacking this post a bit. Since it is been almost 3 years this was asked I was wondering if people still do this, the total number of reads normalization? I m having a hard time comparing two ChIP-seq datasets (normalized by their input). One of the IP libraries is really big compared to the other one and I find that MACS is not showing all the peaks it should. So I guess I have two problems : the first one is one of the IP is not shwoing all the peaks it should and the second one is, how do I compare my two IP libraries if they do not have the same number of reads to start with? 

I posted a question in this forum too in case of, with more details to my pipeline. Thanks!

A: Comparing two ChIP-seq libraries


ADD REPLYlink modified 5.6 years ago by Istvan Albert ♦♦ 84k • written 5.6 years ago by ritarebollo70

it is not a good idea to post a new question in the answer section of an old question - we're not really a forum where threads go on and on - the value of the site is in having one question with answers following to that specific question. Moreover posting here makes the question a lot less visible and far fewer people will take note of it.

I have moved your answer (which was a question really to the comment section of the main post)

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Istvan Albert ♦♦ 84k

MAnorm: To circumvent the issue of differences in S/N ratio between samples, we focused on ChIP-enriched regions (peaks), and introduced a novel idea, that ChIP-Seq common peaks could serve as a reference to build the rescaling model for normalization. This approach is based on the empirical assumption that if a chromatin-associated protein has a substantial number of peaks shared in two conditions, the binding at these common regions will tend to be determined by similar mechanisms, and thus should exhibit similar global binding intensities across samples. This idea is further supported by motif analysis that we present.

ADD REPLYlink written 6 months ago by Shicheng Guo8.2k
gravatar for Istvan Albert
8.2 years ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

Most likely they simply divided the number of reads in larger sample with the number of reads in the smaller sample. Then they used this factor to divide whatever read count they got per each peak for the larger sample.

Imagine that they had 1000 total reads for sample1 and 5000 total reads for sample2. The ratio relative to the smaller number is 5000/1000 = 5 so we expect on average that the peaks in sample2 to contain five times as many reads than in in sample1. To allow for direct comparison between sample1 and sample2 they divide the read counts for peaks in sample2 by 5.

The reason to average to smaller factor is to reduce more data to less rather then boost less data to be more.

ADD COMMENTlink written 8.2 years ago by Istvan Albert ♦♦ 84k

Very clear!!!! That really helps a lot!

ADD REPLYlink written 8.2 years ago by yatung200

Would one have to first normalize to total reads? For example in one of the GEO set I downloaded recently the one sample had 65 million spot # whilst the other had 63 million. In this case is it necessary to first scale by total counts first? Thanks!

ADD REPLYlink written 4 weeks ago by simplitia40
gravatar for Mikael Huss
8.2 years ago by
Mikael Huss4.7k
Mikael Huss4.7k wrote:

This recent publication suggests a more sophisticated way to normalize ChIP-seq data sets.

ADD COMMENTlink written 8.2 years ago by Mikael Huss4.7k

nice, I forgot to mention in my answer that the normalization as described in the original poster's question is somewhat simplistic - but a good start

ADD REPLYlink written 8.2 years ago by Istvan Albert ♦♦ 84k
gravatar for Sukhdeep Singh
8.2 years ago by
Sukhdeep Singh10k wrote:

I usually do this a lot of times, so just check the number of lines in both files, if its bed and store the lowest size of the library among those (mostly sample vs control) and then randomly remove the extra line from the big dataset, by this the datasets are normalized. But one should be cautious as using MockIP (IgG) control has 30-50% less reads then control, so when this normalization is done, you lose information. So, what I think better is to sequence double the amount of control and then the datasets are somewhat comparable and you dont lose information from your sample. I can also, send you a normalization code if you want in R, very easy


ADD COMMENTlink written 8.2 years ago by Sukhdeep Singh10k
gravatar for shenli.sam
7.0 years ago by
shenli.sam190 wrote:

Normalization by sequencing depth (i.e. total read count) is probably the simplest approach but is widely used. There are some drawbacks with this approach. For example, if the signal-to-noise ratio is very different between two libraries, one library is going to contain more background reads than the other. However, these background reads are taken into consideration when you calculate total read counts. This will certainly cause bias in your estimation.

In diffReps, a better approach is taken to do normalization. Basically, the low count regions are first removed from consideration, then a normalization ratio is calculated for each library and each of the regions left. Finally, the medians of the ratios are used as normalization factors. This way, a relatively unbiased, robust estimate can be used for normalization.

ADD COMMENTlink written 7.0 years ago by shenli.sam190

If the percentage of duplicated reads is very different between the two samples, then the final number of usable reads (e.g. by MACS) will still be quite different.

ADD REPLYlink written 6.0 years ago by Ian5.6k
gravatar for dnaseiseq
6.5 years ago by
United Kingdom
dnaseiseq210 wrote:

Bailey T, Krajewski P, Ladunga I, Lefebvre C, Li Q, et al. (2013) Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data. PLoS Comput Biol 9(11): e1003326. doi:10.1371/journal.pcbi.1003326

ADD COMMENTlink written 6.5 years ago by dnaseiseq210

this is a really nice overview - we will try to find a way to feature these more prominently

ADD REPLYlink written 6.5 years ago by Istvan Albert ♦♦ 84k
gravatar for Frenkiboy
8.2 years ago by
Frenkiboy250 wrote:

If the difference in the total number of reads in two samples is not too big, sometimes people

just randomly down sample the bigger one.

ADD COMMENTlink written 8.2 years ago by Frenkiboy250
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 955 users visited in the last hour