Question: Chip-Seq, Get Highly Enriched Regions For A Certain Mark Within One Sample
5
gravatar for Sirus
5.2 years ago by
Sirus770
Boston/USA
Sirus770 wrote:

Hi guys, When reading through many papers, I see many people people mentioning features like H3K4me1+ (or hight), H3K4me3-,.... etc, generally they don't say how they did it automatically to find these regions.

To make the picture clear, my situation is as follow:

  • I have some Chip-Seq data (One sample per mark for a certain cell line)
  • let's say for the H3K4me1 sample I already did the peak calling and I have the list of peaks and the signal intensity in each peak
  • I want to find which peaks are highly enriched (H3K4me1+) and which peaks a lowly enriched (H3K4me1-)
  • A threshold need to be defined to do that

I found this paper (Combinatorial patterns of histone acetylations and methylations in the human genome) that described a method and I don't know if I got their meaning right. I hope through this post to get you kind clarifications :) The authors say:

The modification on a promoter under consideration was deemed significant when the tag count was higher than a threshold, which was determined by a P value taken from a background model of Poisson distribution parameterized by the genome-wide tag density

does it mean to just do :

lambda = mean(tags)

Then if region X has k tag, I will just do:

pval = Poisson( X>= k; lambda)

and consider the one with significant p-values as enriched.

Or it is not the case here?

Edit: An additional question in the context of this question. For the tag enrichment plot, is it preferable to calculate it using the final BED file or using a BigWig file? when using histone marks called reads, you get large regions with one value example

  seqnames             ranges strand |     score
         <Rle>          <IRanges>  <Rle> | <numeric>
  [1]    chr18 [8118742, 8128768]      * |       7.1

so when I do the mean of the tag count it sounds more over estimated.

Thanks in advance.

enrichment chip-seq • 3.2k views
ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Sirus770

Shall we consider the length of the region? I guess the P-value should be calculated as:

Pval = Possion (X>=k; lambda*region_length)

ADD REPLYlink written 3.3 years ago by chjiao345640

Yeah, 

A more elaborated model is in the jmosaics model 

http://www.genomebiology.com/2013/14/4/R38

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Sirus770
1
gravatar for Istvan Albert
5.2 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

There are many different ways to compute significance thresholds, and as almost always the methods all agree where the effects are pronounced and disagree when the effects are weak. In these latter cases the errors rates are also much higher.

Here is a more recent review/summary:

Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data, PLoS Comp. Biology 2013

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Istvan Albert ♦♦ 80k

Hi Albert, Thanks for the paper :). However, they talk about differential binding and comparing peaks between different samples. In my case I have just one sample in one condition and I want to find which peaks within my sample have significant hight enrichment to classify them into highly and lowly enriched regions

ADD REPLYlink written 5.2 years ago by Sirus770
1

Well you have to have some type of background estimate to detect minimally significant enrichment. But after that, among those that you do find to pass the threshold the division into high and low enrichment is a bit arbitrary.

An easier approach would be to do it the other way around, identify peaks by a different attribute and see if their enrichment is different than that of another other group. If you first select by enrichment then the dimensionality of the problem is much larger as there are more causes for the same effect.

ADD REPLYlink written 5.2 years ago by Istvan Albert ♦♦ 80k

In my case I want the threshold to be dynamic for each mark, in this way I can annotate which parts are at the same time H3K4me1+, H3K4me3-,... etc So I thought I can for each mark do the classification, and I can select any combination and just do the intersection. for the moment I am using the Poisson one :), with p-value correction

ADD REPLYlink written 5.2 years ago by Sirus770
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 844 users visited in the last hour