Question: Gaussian mixture model in R
1
LJ210 wrote:

I have some sets of data(data was transformed to log-ratio),and i want to do data normalization like an article wrote: " A 2-component Gaussian mixture model-based normalization algorithm was used to achieve this normalization.The two Gaussians(μ1,σ1) and(μ2,σ2) for a sample i were fitted and used in the normalization process as follows: the mode mi of the log-ratio distribution was determined for each sample using kernel density estimation with a Gaussian kernel and Shafer-Jones bandwidth. Then A two-component Gaussian mixture model was then fit with the mean of both Gaussians constrained to be 𝑚i, i.e., 𝜇1i = 𝜇2i = 𝑚i. The Gaussian with the smaller estimated standard deviation 𝜎𝑖 = min⁡(𝜎̂1𝑖, 𝜎̂2𝑖) was used to normalize the sample. The sample was standardized using N(mi,σi) by subtracting the mean mi from each gene and dividing by the standard deviation σi. Constrained fitting of mixture models was implemented using the mixtools R package."

And i'm not good at statistics stuff ,So can anyone be kind to teach me how to write the R code to achieve the normalization effect just like the article wrote? Thanks in advance.

modified 3.7 years ago by Biostar ♦♦ 20 • written 4.1 years ago by LJ210
1

Does not the paper comes with a code snippet that you can use? it clearly comes with the method of incorporating the mix modeling , the package is a good way to start. Take a look at the package and try to play with the examples and see the mixture plots and how the fit model is derived. You can also see how `mclust` is used for the same purpose of mixture model fit for clustering here.

This blog post is also having a nice and lucid explanation you can take a look at it and alternatively this link also serves a nice code snippet and explanation of when and why to use with `mixtools` and apply it for clustering.

1

Thanks first.There is no code snippet in the article.And I just want to do the normalization by 2-component Gaussian mixture model-based algorithm as the article wrote,So how to finish the following code : library(mixtools) a<-rnorm(1000) mi<-density(a)#### kernel density estimation with a Gaussian kernel and Shafer-Jones bandwidth ??? out<-mvnormalmixEM(a,k=2,lambda=NULL,mu =?,sigma = ?) ####like the article said "μ1i =μ2i =mi" ,so the'mu'should be set to equal? But the mi is a vector of length 1000,SO how do i set the 'mu' and 'sigma'? And at last ,i hope to get the result that 2 component (μ1i =μ2i =mi,sigma1≠sigma2),and i can choose the smaller sigma to normalize the sample.

3
Jean-Karim Heriche22k wrote:

Try something like this:

``````library(stats) # for bw.SJ() and density()
library(modeest) # for mlv()
library(mixtools) # for mvnormalmixEM()
# Assume data for sample i is in D
b<-bw.SJ(D,...) # Find bandwith with Shafer-Jones method
d<-density(data, bw=b, kernel="gaussian") # Gaussian kernel density estimate
M<-mlv(d,method="Parzen") # Find the mode of a univariate distribution
mode<-M\$M
model<-normalmixEM(data,mu=c(mode,mode),sigma=NULL)
sigma = min(model\$sigma)
``````
1

Thanks for the reply.And some questions for you: 1)M<-mlv(d,method="Parzen") mode<-M\$M #### What dou you mean by this ? and the article said "two-component Gaussian mixture model was then fit with the mean of both Gaussians constrained to be mi(μ1i =μ2i =mi) 2)I just ran your code,and i used the 2-component Gaussian mixture model,so k was set to 2,and the result is : summary(model) summary of normalmixEM object: comp1 comp2 lambda 0.9718554 0.0281446 mu -0.0263047 0.0768719 sigma 1.0001167 0.0764447 loglik at estimate:-1402.182 SO the question is why mu1≠mu2≠mode when we set the mu=c(mode,mode)?

1

mlv() returns an mlv object in which the value of the mode is in \$M. To constrain the means in normalmixEM, you need mean.constr=c(mode,mode). Always check the functions docs.

1

ok,i get it,thanks. And(1)is this right that i set the k to 2 for the situation in the article? other parameters,just like lambda,epsilon,maxit...? or just defalut when there is no special requirement? (2)The article wrote:"The sample was standardized using N(mi,σi) by subtracting the mean mi from each gene and dividing by the standard deviation σi",and the mi is the mode of the sample i data distribution,so mi should be a single value,so why the author wrote"by subtracting the mean mi from each gene"?confused..

1

1- If by k, you're referring to the number of components in the Gaussian mixture then yes

2- As I understand what you describe, once the model is computed (one model per sample), the Gaussian with the smallest sigma is selected for normalization i.e. its mean and sigma are used to standardize values for each gene in the sample.

1

i just tried the code:
a<-rnorm(1000) ### example data
b<-bw.SJ(a)
d<-density(a, bw=b, kernel="gaussian") # Gaussian kernel density estimate
M<-mlv(d,method="Parzen") # Find the mode of a univariate distribution
mode<-M\$M
And i found the mode sometimes not a single value,maybe 2 or more,so which value should i choose for mu=c(mode,mode) in normalmixEM?

1

Plot the density estimate with

``````plot(d)
``````

to check if the density estimate has more than one peak or a pronounced shoulder.

Actually, since we're dealing with a Gaussian, we can also do

``````mode<-d\$x[which.max(d\$y)]
``````
1

Thanks a million for your nice help. It helps me a lot.