Question

Chip-Seq, Get Highly Enriched Regions For A Certain Mark Within One Sample

6

Entering edit mode

10.2 years ago

Sirus ▴ 820

Hi guys, When reading through many papers, I see many people people mentioning features like H3K4me1+ (or hight), H3K4me3-,.... etc, generally they don't say how they did it automatically to find these regions.

To make the picture clear, my situation is as follow:

I have some Chip-Seq data (One sample per mark for a certain cell line)
let's say for the H3K4me1 sample I already did the peak calling and I have the list of peaks and the signal intensity in each peak
I want to find which peaks are highly enriched (H3K4me1+) and which peaks a lowly enriched (H3K4me1-)
A threshold need to be defined to do that

I found this paper (Combinatorial patterns of histone acetylations and methylations in the human genome) that described a method and I don't know if I got their meaning right. I hope through this post to get you kind clarifications :) The authors say:

The modiﬁcation on a promoter under consideration was deemed signiﬁcant when the tag count was higher than a threshold, which was determined by a P value taken from a background model of Poisson distribution parameterized by the genome-wide tag density

does it mean to just do :

lambda = mean(tags)

Then if region X has k tag, I will just do:

pval = Poisson( X>= k; lambda)

and consider the one with significant p-values as enriched.

Or it is not the case here?

Edit: An additional question in the context of this question. For the tag enrichment plot, is it preferable to calculate it using the final BED file or using a BigWig file? when using histone marks called reads, you get large regions with one value example

  seqnames             ranges strand |     score
         <Rle>          <IRanges>  <Rle> | <numeric>
  [1]    chr18 [8118742, 8128768]      * |       7.1

so when I do the mean of the tag count it sounds more over estimated.

Thanks in advance.

chip-seq enrichment • 4.7k views

ADD COMMENT • link 10.1 years ago by Sirus ▴ 820

0

Entering edit mode

Shall we consider the length of the region? I guess the P-value should be calculated as:

Pval = Possion (X>=k; lambda*region_length)

ADD REPLY • link 8.3 years ago by chjiao3456 ▴ 40

0

Entering edit mode

Yeah,

A more elaborated model is in the jmosaics model

http://www.genomebiology.com/2013/14/4/R38

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by Sirus ▴ 820

score 3 · Answer 1 · 2014-03-19

3

Entering edit mode

10.2 years ago

Istvan Albert 100k

There are many different ways to compute significance thresholds, and as almost always the methods all agree where the effects are pronounced and disagree when the effects are weak. In these latter cases the errors rates are also much higher.

Here is a more recent review/summary:

Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data, PLoS Comp. Biology 2013

ADD COMMENT • link 10.2 years ago by Istvan Albert 100k

0

Entering edit mode

Hi Albert, Thanks for the paper :). However, they talk about differential binding and comparing peaks between different samples. In my case I have just one sample in one condition and I want to find which peaks within my sample have significant hight enrichment to classify them into highly and lowly enriched regions

ADD REPLY • link 10.1 years ago by Sirus ▴ 820

1

Entering edit mode

Well you have to have some type of background estimate to detect minimally significant enrichment. But after that, among those that you do find to pass the threshold the division into high and low enrichment is a bit arbitrary.

An easier approach would be to do it the other way around, identify peaks by a different attribute and see if their enrichment is different than that of another other group. If you first select by enrichment then the dimensionality of the problem is much larger as there are more causes for the same effect.

ADD REPLY • link 10.1 years ago by Istvan Albert 100k

0

Entering edit mode

In my case I want the threshold to be dynamic for each mark, in this way I can annotate which parts are at the same time H3K4me1+, H3K4me3-,... etc So I thought I can for each mark do the classification, and I can select any combination and just do the intersection. for the moment I am using the Poisson one :), with p-value correction

ADD REPLY • link 10.1 years ago by Sirus ▴ 820