I am analyzing CpG dinucleotide patterns in promoter binding consensus sequences for as many TFs as possible. I am trying to use R for my analysis and my approach is to first retrieve known consensus sequences (of TFs such as HRE, CREB etc) and determine whether CpG dinucleotides (NOT CpG islands) within them are over- or under-represented by either comparing their frequency in the sequence with that of the entire genome or with a randomly generated sequence of similar length. I have been using R packages "MotifDb" and "seqinr" but I couldn't find any package that would do the statistical representation analysis. I have following questions:
(1) Any suggestions for packages? Also what statistical test would be the most accurate for doing the representation analysis? (rho/z-score/other?)
(2) Are there any packages that let you do the same analysis for TFs lacking a known validated sequence? (my approach is to first construct random DNA strings of variable length, but not sure how to go about it?)
Thanks a lot!