Statistical Representation Of Cpg Dinucleotides In Promoter Regions
1
1
Entering edit mode
10.9 years ago
Misty ▴ 10

Hello all,

I am analyzing CpG dinucleotide patterns in promoter binding consensus sequences for as many TFs as possible. I am trying to use R for my analysis and my approach is to first retrieve known consensus sequences (of TFs such as HRE, CREB etc) and determine whether CpG dinucleotides (NOT CpG islands) within them are over- or under-represented by either comparing their frequency in the sequence with that of the entire genome or with a randomly generated sequence of similar length. I have been using R packages "MotifDb" and "seqinr" but I couldn't find any package that would do the statistical representation analysis. I have following questions:

(1) Any suggestions for packages? Also what statistical test would be the most accurate for doing the representation analysis? (rho/z-score/other?)

(2) Are there any packages that let you do the same analysis for TFs lacking a known validated sequence? (my approach is to first construct random DNA strings of variable length, but not sure how to go about it?)

Thanks a lot!

Misty

statistics promoter r • 3.2k views
ADD COMMENT
1
Entering edit mode
10.9 years ago

Perhaps you could apply a hypergeometric model.

You count the number of CpG dinucleotides (K balls) in your background (one big "urn" containing N balls or all dinucleotides), and count the total number of dinucleotides of any other non-CpG constituency (N-K balls). For each promoter, you count the number of observed CpG dinuc's (k balls) and the total number of all dinuc's across the promoter, including CpGs (n balls).

For each promoter, then, you can calculate a p-value for the likelihood of observing CpG dinuc. enrichment within a promoter over the background (say, the whole genome, or some thoughtful subset of it).

To do this with R, see the phyper() function:

> df 
  promoter_name  N      K     n    k
  foo            27746  2825  775  320
  ...

> fn <- function(x, output) {
    promoter_name <- x[1]
    N <- x[2]
    K <- x[3]
    n <- x[4]
    k <- x[5]
    cat(promoter_name, "\t", phyper(k, K, (N-K), n, lower.tail=F))
}

> apply(df, 1, fn)
foo      8.115953e-119
...

You could store and rank your promoters by p-value and focus analysis on top-ranked (lowest p-value) hits.

ADD COMMENT
0
Entering edit mode

Thanks Alex.. will try it out!

ADD REPLY

Login before adding your answer.

Traffic: 1338 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6