estimate population allele frequencies with EM (expectation-maximation) algorithm using R
0
0
Entering edit mode
3.0 years ago
Ana ▴ 180

Hello everyone, I have asked this question in Stack Exchange Bioinformatics forum, unfortunately I have not been able to get any help there. So, I am asking it here, maybe some R professionalists can help me to fix the final part of my R-code. I would really appreciate if anyone can help me to resolve the issue.

I am trying to estimate population allele frequency for CNVs (it can only result in genotyping CNV as a dominant marker) using EM algorithm manual in R. Below is the first few lines of my data-set. I have 40,000 genes and for each gene I have 2 rows (the first row is number of individuals that do not have the gene, __a, and the second row is the number of individuals that actually have the gene, __p). The present gene is dominant and can be either homozygote (pp) or heterozygote (pa).

head(data_all)
genes pop_1
1 Ha1_00044004__a    0
2 Ha1_00044004__p    8
3 Ha1_00043269__a    1
4 Ha1_00043269__p    7
5 Ha1_00045379__a    0
6 Ha1_00045379__p    8


I want to estimate the frequency of present genes in my population. I have used the following code, but I am stuck in the last part (I would like to repeat the process unless a condition is met and it is converged).

even_indexes<-seq(2,48216,2)  ## draw even rows which are counts for present genes

mu.p <- 0.5
mu.a <- 0.5

## step E  (loop through all genes)
out_res <- NULL

#ABOexpect <- function(pp,mu.a,mu.o){
mu.p <- 0.5
mu.a <- 0.5
N <- data.frame(npp,npa)
out_res<-rbind (out_res,N)
}

### combine the outcome of Estep with present and absent count
data_expect<-cbind(out_res,count_present,count_absent)
colnames(data_expect)<-c("npp","npa","count_present","count_absent")

### step M (loop through all genes)

out_res_m <- NULL
for (i in 1:nrow(data_expect)){
#ABOmaximize <- function(npp,npa,naa) {
n <-8
p <- ((2*data_expect[i,1]) + data_expect[i,2])/ (2 * n)
a <- ((2*data_expect[i,4]) + data_expect[i,2])/ (2 * n)
L <- data.frame(p,a)
out_res_m<-rbind (out_res_m,L)
}

## combine out_res_m with data expect to make a final data set
data_max<-cbind(out_res_m,data_expect)

gene_iD.     Present_count absent_count app   naa  p  a
Ha1_00044004  8  0  2.666667  5.333333  0.66666667 0.3333333
Ha1_00043269  7  1  2.333333  4.666667  0.58333333 0.4166667
Ha1_00045379  8  0  2.666667  5.333333  0.66666667 0.3333333
.
.
.


The final part that I am stuck is the iteration part, repeat the process until convergence. If I had only one gene I could use something like this :

   iter <- 1
diff <- 1
tol <- 0.0001 ## tolerance

while (diff > tol) {
E <- ABOexpect(np,mu.p,mu.a)
M <- ABOmaximize(E$pp,E$pa,np)
diff <- abs(M$mu.p - mu.p) + abs(M$mu.a - mu.a)
mu.a <- M$mu.p mu.o <- M$mu.a
cat(sprintf("iter:%d, diff:%.2f\n",iter,diff))
iter <- iter + 1
}
cat(sprintf("mu_a=%.4f, mu_o=%.4f\n",mu.a,mu.o))


But my problem is that I have 40,000 genes. I do not know how can I modify the code above to get my desired out put!

R EM algorithm • 591 views