Question: Statistical Representation Of Cpg Dinucleotides In Promoter Regions
1
gravatar for Misty
6.8 years ago by
Misty10
Misty10 wrote:

Hello all,

I am analyzing CpG dinucleotide patterns in promoter binding consensus sequences for as many TFs as possible. I am trying to use R for my analysis and my approach is to first retrieve known consensus sequences (of TFs such as HRE, CREB etc) and determine whether CpG dinucleotides (NOT CpG islands) within them are over- or under-represented by either comparing their frequency in the sequence with that of the entire genome or with a randomly generated sequence of similar length. I have been using R packages "MotifDb" and "seqinr" but I couldn't find any package that would do the statistical representation analysis. I have following questions:

(1) Any suggestions for packages? Also what statistical test would be the most accurate for doing the representation analysis? (rho/z-score/other?)

(2) Are there any packages that let you do the same analysis for TFs lacking a known validated sequence? (my approach is to first construct random DNA strings of variable length, but not sure how to go about it?)

Thanks a lot!

Misty

R promoter statistics • 2.3k views
ADD COMMENTlink modified 6.8 years ago by Alex Reynolds29k • written 6.8 years ago by Misty10
1
gravatar for Alex Reynolds
6.8 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

Perhaps you could apply a hypergeometric model.

You count the number of CpG dinucleotides (K balls) in your background (one big "urn" containing N balls or all dinucleotides), and count the total number of dinucleotides of any other non-CpG constituency (N-K balls). For each promoter, you count the number of observed CpG dinuc's (k balls) and the total number of all dinuc's across the promoter, including CpGs (n balls).

For each promoter, then, you can calculate a p-value for the likelihood of observing CpG dinuc. enrichment within a promoter over the background (say, the whole genome, or some thoughtful subset of it).

To do this with R, see the phyper() function:

> df 
  promoter_name  N      K     n    k
  foo            27746  2825  775  320
  ...

> fn <- function(x, output) {
    promoter_name <- x[1]
    N <- x[2]
    K <- x[3]
    n <- x[4]
    k <- x[5]
    cat(promoter_name, "\t", phyper(k, K, (N-K), n, lower.tail=F))
}

> apply(df, 1, fn)
foo      8.115953e-119
...

You could store and rank your promoters by p-value and focus analysis on top-ranked (lowest p-value) hits.

ADD COMMENTlink modified 3 months ago by RamRS26k • written 6.8 years ago by Alex Reynolds29k

Thanks Alex.. will try it out!

ADD REPLYlink written 6.8 years ago by Misty10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2147 users visited in the last hour