Question

Chip-Seq Enrichment Profile Significance?

2

Entering edit mode

10.6 years ago

daniel.soronellas ▴ 330

Dear community,

I have a question related to Enrichment profiles in ChIPseq data.

Is it possible to calculate significance of enrichment between two ChIPseq profiles in specific genomic locations (i.e. TSS)? How this could be done?

Below I put an example I found just google-ing for some images: Imagine If in the case of the image, I want to know if the enrichment that shows the red line (0.26-0.27) is significant compared with the blue line (0.09-0.10).

enter image description here

Reference of the image: RYBP and Cbx7 Define Specific Biological Functions of Polycomb Complexes in Mouse Embryonic Stem Cells (Cell Reports, Feb 2013), Fig. 2-E http://download.cell.com/cell-reports/pdf/PIIS2211124712004238.pdf?intermediate=true

Thanks for your help!

chip-seq • 4.8k views

ADD COMMENT • link updated 8.8 years ago by AlexAbdulkaderKheirallah ▴ 120 • written 10.6 years ago by daniel.soronellas ▴ 330

0

Entering edit mode

use this tool

https://github.com/shenlab-sinai/ngsplot

ADD REPLY • link 8.8 years ago by AlexAbdulkaderKheirallah ▴ 120

score 3 · Answer 1 · 2013-09-02

There are several ways to do this, depending on how many parametric assumptions you want to make about the data.

Personally, I think the most believable approach is a permutation-based strategy focusing on how the two average profiles are generated (the red and blue lines in your plot). I would generate many control datasets where the assignments between binding sites and genomic labels (e.g. near TSS and not near TSS) are shuffled. I would then look at the distribution of test statistics across the random "control" experiments, and see where the actual (unshuffled) experiment falls on the list. This will give you an empirical p-value that, at least to me, would be believable. It would definitely be better than chi-squared tests, as proposed in another answer.

For a test statistic comparing the two distributions, you have several choices. See the Wikipedia page on statistical distances for a nice list. KL divergence or earth mover's distance would be my first choices, but only out of habit, not any principled reasons.

score 2 · Answer 2 · 2013-09-02

You could try taking an average the signal of a window i.e. 0 to 500 bp after the TSS for both the red and blue lines (peaks). Then You could take an average of the signal where these proteins are not enriched i.e. -2500 to -2000 bp before the transcription start site for both the blue and red lines (trough). You will now have four numbers, red line peak and trough and blue line peak and trough averages. Now perform a Chi-square test where the expected range is the blue line and the actual is the red line numbers. This will give you a p-value.

in excel it would look like this: =CHITEST(red peak avg.:red trough avg., blue peak avg.:blue trough avg.

or (using estimates just by looking at the graph)

=CHITEST(0.28:0.055,0.11:0.033)

Another way could be just to use the raw signal values and do a chi-square test on the red and blue lines so something like this: =CHITEST(red signal -2500:red signal 2500, blue signal -2500:blue signal 2500) -In this method you would be using all the data points, not just averages.

Other ways/caveats:

-Instead of using just the red and blue line averages, include the Ig (mock IP) control by using ratios i.e. (blue line average peak)/ (black line average peak). This will still give you four numbers at the end, but they will all be ratios.

-You may want to consult a stats person too, because I'm not sure if a chi-square test is the best for answering your question, but it will give you a p-value.

-I'm not sure what p-value would be considered significant.

I'm interested in hearing what others have to say about this because I too have wondered if there was a proper way to do such an analysis.