Question

Statistical Representation Of Cpg Dinucleotides In Promoter Regions

1

Entering edit mode

12.1 years ago

Misty ▴ 10

Hello all,

I am analyzing CpG dinucleotide patterns in promoter binding consensus sequences for as many TFs as possible. I am trying to use R for my analysis and my approach is to first retrieve known consensus sequences (of TFs such as HRE, CREB etc) and determine whether CpG dinucleotides (NOT CpG islands) within them are over- or under-represented by either comparing their frequency in the sequence with that of the entire genome or with a randomly generated sequence of similar length. I have been using R packages "MotifDb" and "seqinr" but I couldn't find any package that would do the statistical representation analysis. I have following questions:

(1) Any suggestions for packages? Also what statistical test would be the most accurate for doing the representation analysis? (rho/z-score/other?)

(2) Are there any packages that let you do the same analysis for TFs lacking a known validated sequence? (my approach is to first construct random DNA strings of variable length, but not sure how to go about it?)

Thanks a lot!

Misty

statistics promoter r • 3.5k views

ADD COMMENT • link updated 12.1 years ago by Alex Reynolds 36k • written 12.1 years ago by Misty ▴ 10

Ram · Answer 1 · 2013-06-04

Perhaps you could apply a hypergeometric model.

You count the number of CpG dinucleotides (K balls) in your background (one big "urn" containing N balls or all dinucleotides), and count the total number of dinucleotides of any other non-CpG constituency (N-K balls). For each promoter, you count the number of observed CpG dinuc's (k balls) and the total number of all dinuc's across the promoter, including CpGs (n balls).

For each promoter, then, you can calculate a p-value for the likelihood of observing CpG dinuc. enrichment within a promoter over the background (say, the whole genome, or some thoughtful subset of it).

To do this with R, see the phyper() function:

> df 
  promoter_name  N      K     n    k
  foo            27746  2825  775  320
  ...

> fn <- function(x, output) {
    promoter_name <- x[1]
    N <- x[2]
    K <- x[3]
    n <- x[4]
    k <- x[5]
    cat(promoter_name, "\t", phyper(k, K, (N-K), n, lower.tail=F))
}

> apply(df, 1, fn)
foo      8.115953e-119
...

You could store and rank your promoters by p-value and focus analysis on top-ranked (lowest p-value) hits.