Question

Chip-Seq Normalization

2

Entering edit mode

12.7 years ago

Repineme ▴ 120

Hi,

I have sequenced human ChIP-Seq samples from 2 different experiments using Illumina. The number of reads are not equivalent between the 2 samples (Heart ChIP-Seq= 2million tags and Kidney ChIP-Seq= 10 million) and I have no replicates.

When ever I try to plot raw reads around promoters I'm failing (one flat line on top and another on bottom) because of the difference in number of reads. Does any one know what is the BEST way to deal this ?

I tried this [not successful ]

position_cDNAnorm = (position_cDNA / sum_cDNA) * average_sum_cDNA

position_cDNAnorm = normalised cDNA value for specific position and specific DBP

position_cDNA = cDNA value for specific position and specific DBP

sum_cDNA = total cDNA count for specific DBP

average_sum_cDNA = average of total cDNA counts of all DBPs DBP= DNA Bindign Protein (Transcription factor)

data chip-seq • 11k views

ADD COMMENT • link updated 9.6 years ago by Biostar 20 • written 12.7 years ago by Repineme ▴ 120

0

Entering edit mode

What if you get signal (bedgraph) lets say from macs when you run with option -B

then calculate the average tag count in the same window of both samples and then divide by number of total reads mapped in million.

ADD REPLY • link 9.6 years ago by Manvendra Singh ★ 2.2k

score 7 · Answer 1 · 2011-08-07

7

Entering edit mode

12.7 years ago

seidel 11k

Instead of plotting raw reads, plot the rate at which reads are observed in a given location. It sounds odd expressed that way, but basically what you want to observe is reads per million per nucleotide (RPM). However, since nucleotide resolution is pretty extreme, people usually pick a larger bin, say 25 nucleotides, and then you calculate the number of reads that fall into that bin divided by the number of reads in the sample data set, then multiply by 10^6 to get per million. In this way you get an RPM track of 25 base bins covering the genome, thus samples with different numbers of reads become comparable. If your data is in the form of a vector representing coverage, this is especially easy to do in R.

There's a good description of both your issues: identification of enriched regions at promoters, and quantile normalization of reads in the supplemental portion of the following two papers from the Young Lab: Rahl et al. (2010) Cell and Bilodeau et al. (2010) Genes Dev.

ADD COMMENT • link 12.7 years ago by seidel 11k

1

Entering edit mode

I'm not sure what you mean by "didn't work". Either a region has coverage, or it doesn't. If it doesn't have coverage - there is no way to get coverage besides doing more sequencing or repeating the experiment. If it does have coverage, then you should be able to visualize it by simply loading the indexed BAM file to UCSC. If you want to normalize that coverage, then you can convert it to something like reads per million for a given bin size, but even then, whether the regions show similar patterns or depth is a matter of experimental determination (as opposed to assumption).

ADD REPLY • link 12.7 years ago by seidel 11k

0

Entering edit mode

Seems logical. One I get 25bin-chr-sta-end-starnd data1-RPM-coverage dat2-RPM-coverage (3columns). Is there any way to plot them around my own genomic regions (TSS or exon-intron junctions) ?