Question: Chip-Seq Normalization
gravatar for Repineme
7.6 years ago by
Repineme110 wrote:


I have sequenced human ChIP-Seq samples from 2 different experiments using Illumina. The number of reads are not equivalent between the 2 samples (Heart ChIP-Seq= 2million tags and Kidney ChIP-Seq= 10 million) and I have no replicates.

When ever I try to plot raw reads around promoters I'm failing (one flat line on top and another on bottom) because of the difference in number of reads. Does any one know what is the BEST way to deal this ?

I tried this [not successful ]

position_cDNAnorm = (position_cDNA / sum_cDNA) * average_sum_cDNA

  • position_cDNAnorm = normalised cDNA value for specific position and specific DBP
  • position_cDNA = cDNA value for specific position and specific DBP
  • sum_cDNA = total cDNA count for specific DBP
  • average_sum_cDNA = average of total cDNA counts of all DBPs DBP= DNA Bindign Protein (Transcription factor)
data chip-seq • 8.7k views
ADD COMMENTlink modified 4.5 years ago by Biostar ♦♦ 20 • written 7.6 years ago by Repineme110

What if you get signal (bedgraph) lets say from macs when you run with option -B

then calculate the average tag count in the same window of both samples and then divide by number of total reads mapped in million.

ADD REPLYlink written 4.5 years ago by Manvendra Singh2.0k
gravatar for seidel
7.6 years ago by
United States
seidel6.8k wrote:

Instead of plotting raw reads, plot the rate at which reads are observed in a given location. It sounds odd expressed that way, but basically what you want to observe is reads per million per nucleotide (RPM). However, since nucleotide resolution is pretty extreme, people usually pick a larger bin, say 25 nucleotides, and then you calculate the number of reads that fall into that bin divided by the number of reads in the sample data set, then multiply by 10^6 to get per million. In this way you get an RPM track of 25 base bins covering the genome, thus samples with different numbers of reads become comparable. If your data is in the form of a vector representing coverage, this is especially easy to do in R.

There's a good description of both your issues: identification of enriched regions at promoters, and quantile normalization of reads in the supplemental portion of the following two papers from the Young Lab: Rahl et al. (2010) Cell and Bilodeau et al. (2010) Genes Dev.

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by seidel6.8k

I'm not sure what you mean by "didn't work". Either a region has coverage, or it doesn't. If it doesn't have coverage - there is no way to get coverage besides doing more sequencing or repeating the experiment. If it does have coverage, then you should be able to visualize it by simply loading the indexed BAM file to UCSC. If you want to normalize that coverage, then you can convert it to something like reads per million for a given bin size, but even then, whether the regions show similar patterns or depth is a matter of experimental determination (as opposed to assumption).

ADD REPLYlink written 7.6 years ago by seidel6.8k

Seems logical. One I get 25bin-chr-sta-end-starnd data1-RPM-coverage dat2-RPM-coverage (3columns). Is there any way to plot them around my own genomic regions (TSS or exon-intron junctions) ?

ADD REPLYlink written 7.6 years ago by Repineme110

didn't work this too. produced the same results like my type of normalization

ADD REPLYlink written 7.6 years ago by Repineme110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1739 users visited in the last hour