Question

estimate population allele frequencies with EM (expectation-maximation) algorithm using R

0

Entering edit mode

7.2 years ago

Ana ▴ 200

Hello everyone, I have asked this question in Stack Exchange Bioinformatics forum, unfortunately I have not been able to get any help there. So, I am asking it here, maybe some R professionalists can help me to fix the final part of my R-code. I would really appreciate if anyone can help me to resolve the issue.

I am trying to estimate population allele frequency for CNVs (it can only result in genotyping CNV as a dominant marker) using EM algorithm manual in R. Below is the first few lines of my data-set. I have 40,000 genes and for each gene I have 2 rows (the first row is number of individuals that do not have the gene, __a, and the second row is the number of individuals that actually have the gene, __p). The present gene is dominant and can be either homozygote (pp) or heterozygote (pa).

head(data_all)
                genes pop_1
    1 Ha1_00044004__a    0
    2 Ha1_00044004__p    8
    3 Ha1_00043269__a    1
    4 Ha1_00043269__p    7
    5 Ha1_00045379__a    0
    6 Ha1_00045379__p    8

I want to estimate the frequency of present genes in my population. I have used the following code, but I am stuck in the last part (I would like to repeat the process unless a condition is met and it is converged).

even_indexes<-seq(2,48216,2)  ## draw even rows which are counts for present genes
x.loadings <- data.frame(x=data_all[even_indexes,2])

mu.p <- 0.5 
mu.a <- 0.5

    ## step E  (loop through all genes)
            out_res <- NULL

             for (i in 1:nrow(x.loadings)){
                 #ABOexpect <- function(pp,mu.a,mu.o){
                     mu.p <- 0.5 
                     mu.a <- 0.5
                     npp <- x.loadings[i,] * (mu.p^2/(mu.p^2 + 2*(mu.p*mu.a)))
                     npa <- x.loadings[i,] * ((2*(mu.p*mu.a)/(mu.p^2 + 2*(mu.a*mu.p))))
                     N <- data.frame(npp,npa)
                     out_res<-rbind (out_res,N)
             }

### combine the outcome of Estep with present and absent count 
        data_expect<-cbind(out_res,count_present,count_absent)
        colnames(data_expect)<-c("npp","npa","count_present","count_absent")


            ### step M (loop through all genes)

            out_res_m <- NULL
            for (i in 1:nrow(data_expect)){
            #ABOmaximize <- function(npp,npa,naa) {
                n <-8
                p <- ((2*data_expect[i,1]) + data_expect[i,2])/ (2 * n)
                a <- ((2*data_expect[i,4]) + data_expect[i,2])/ (2 * n)
                L <- data.frame(p,a)
                out_res_m<-rbind (out_res_m,L)
                }

## combine out_res_m with data expect to make a final data set
data_max<-cbind(out_res_m,data_expect)

head(data_mx)
gene_iD.     Present_count absent_count app   naa  p  a
Ha1_00044004  8  0  2.666667  5.333333  0.66666667 0.3333333
Ha1_00043269  7  1  2.333333  4.666667  0.58333333 0.4166667
Ha1_00045379  8  0  2.666667  5.333333  0.66666667 0.3333333
.
.
.

The final part that I am stuck is the iteration part, repeat the process until convergence. If I had only one gene I could use something like this :

   iter <- 1
    diff <- 1
    tol <- 0.0001 ## tolerance

    while (diff > tol) {
          E <- ABOexpect(np,mu.p,mu.a)
          M <- ABOmaximize(E$pp,E$pa,np)
          diff <- abs(M$mu.p - mu.p) + abs(M$mu.a - mu.a)
          mu.a <- M$mu.p
          mu.o <- M$mu.a
          cat(sprintf("iter:%d, diff:%.2f\n",iter,diff))
          iter <- iter + 1
    }
    cat(sprintf("mu_a=%.4f, mu_o=%.4f\n",mu.a,mu.o))

But my problem is that I have 40,000 genes. I do not know how can I modify the code above to get my desired out put!

R EM algorithm • 1.1k views

ADD COMMENT • link 7.2 years ago by Ana ▴ 200