Question: estimate population allele frequencies with EM (expectation-maximation) algorithm using R
gravatar for Ana
18 months ago by
Ana170 wrote:

Hello everyone, I have asked this question in Stack Exchange Bioinformatics forum, unfortunately I have not been able to get any help there. So, I am asking it here, maybe some R professionalists can help me to fix the final part of my R-code. I would really appreciate if anyone can help me to resolve the issue.

I am trying to estimate population allele frequency for CNVs (it can only result in genotyping CNV as a dominant marker) using EM algorithm manual in R. Below is the first few lines of my data-set. I have 40,000 genes and for each gene I have 2 rows (the first row is number of individuals that do not have the gene, __a, and the second row is the number of individuals that actually have the gene, __p). The present gene is dominant and can be either homozygote (pp) or heterozygote (pa).

                genes pop_1
    1 Ha1_00044004__a    0
    2 Ha1_00044004__p    8
    3 Ha1_00043269__a    1
    4 Ha1_00043269__p    7
    5 Ha1_00045379__a    0
    6 Ha1_00045379__p    8

I want to estimate the frequency of present genes in my population. I have used the following code, but I am stuck in the last part (I would like to repeat the process unless a condition is met and it is converged).

even_indexes<-seq(2,48216,2)  ## draw even rows which are counts for present genes
x.loadings <- data.frame(x=data_all[even_indexes,2])

mu.p <- 0.5 
mu.a <- 0.5

    ## step E  (loop through all genes)
            out_res <- NULL

             for (i in 1:nrow(x.loadings)){
                 #ABOexpect <- function(pp,mu.a,mu.o){
                     mu.p <- 0.5 
                     mu.a <- 0.5
                     npp <- x.loadings[i,] * (mu.p^2/(mu.p^2 + 2*(mu.p*mu.a)))
                     npa <- x.loadings[i,] * ((2*(mu.p*mu.a)/(mu.p^2 + 2*(mu.a*mu.p))))
                     N <- data.frame(npp,npa)
                     out_res<-rbind (out_res,N)

### combine the outcome of Estep with present and absent count 

            ### step M (loop through all genes)

            out_res_m <- NULL
            for (i in 1:nrow(data_expect)){
            #ABOmaximize <- function(npp,npa,naa) {
                n <-8
                p <- ((2*data_expect[i,1]) + data_expect[i,2])/ (2 * n)
                a <- ((2*data_expect[i,4]) + data_expect[i,2])/ (2 * n)
                L <- data.frame(p,a)
                out_res_m<-rbind (out_res_m,L)

## combine out_res_m with data expect to make a final data set

gene_iD.     Present_count absent_count app   naa  p  a
Ha1_00044004  8  0  2.666667  5.333333  0.66666667 0.3333333
Ha1_00043269  7  1  2.333333  4.666667  0.58333333 0.4166667
Ha1_00045379  8  0  2.666667  5.333333  0.66666667 0.3333333

The final part that I am stuck is the iteration part, repeat the process until convergence. If I had only one gene I could use something like this :

   iter <- 1
    diff <- 1
    tol <- 0.0001 ## tolerance

    while (diff > tol) {
          E <- ABOexpect(np,mu.p,mu.a)
          M <- ABOmaximize(E$pp,E$pa,np)
          diff <- abs(M$mu.p - mu.p) + abs(M$mu.a - mu.a)
          mu.a <- M$mu.p
          mu.o <- M$mu.a
          cat(sprintf("iter:%d, diff:%.2f\n",iter,diff))
          iter <- iter + 1
    cat(sprintf("mu_a=%.4f, mu_o=%.4f\n",mu.a,mu.o))

But my problem is that I have 40,000 genes. I do not know how can I modify the code above to get my desired out put!

em algorithm R • 357 views
ADD COMMENTlink written 18 months ago by Ana170
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1029 users visited in the last hour