Question: Gene Expression - Check Multivariate Normal In R?
gravatar for Darren J. Fitzpatrick
9.0 years ago by
Ireland/ United Kingdom
Darren J. Fitzpatrick1.1k wrote:


I have a gene expression microarray dataset with dimensionality 427 x ~40,000.

I wish to test if this data follows a multivariate normal distibution. Within R in the mvnormtest library the mshapiro.test() function (Shapiro-Wilkes test) only permits vectors no longer than 5000 entries.

I also attempted using the mahalanobis distance squared ( when plotted on a QQ-plot it should generate a Chi-Squared distribution if the distibution of the data is normal). However, this requires the calculation of a covariance matrix which is not feasible for a data set this large (or wide).

Do you guys have any suggestions for alternative tests of multivariate normality for a large dataset preferably but not necessarily with R.

Regards, S ;-)

R microarray statistics • 6.1k views
ADD COMMENTlink modified 8.7 years ago by Neilfws48k • written 9.0 years ago by Darren J. Fitzpatrick1.1k

I doubt that the calculation of SW makes sense for the whole data-set. I will try to explain this in an answer later.

ADD REPLYlink written 9.0 years ago by Michael Dondrup46k
gravatar for Alastair Kerr
9.0 years ago by
Alastair Kerr5.2k
The University of Edinburgh, UK
Alastair Kerr5.2k wrote:

With really big datasets very small deviations from gaussian can be significant even though the t-test is tolerant to them. That said the increased sensitivity of parametric tests may not matter with such large datasets. Hence the KS-test is usually my 1st choice with this sort of data.

But answering your question, D'agostino-Pearson could perhaps be used, see here

ADD COMMENTlink written 9.0 years ago by Alastair Kerr5.2k
gravatar for Michael Dondrup
9.0 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

Possibly you are thinking about doing a study like this?

I assume you have >400 MA with 40000 probesets (how many replicates per condition)

Note that you should calculate the SW-test per probe-set/gene under the same experimental condition. Then you can get an estimate of which proportion of your genes have an error distribution significantly different from a normal.

The data on a single micro-array is highly unlikely to be normal anyway, because it contains genes of different expression level, e.g. consider you get a proportion of up-regulated genes down-regulated and "0"-regulated genes. Even if each of the population was normal in itself, the resulting mixture of Gaussians will not.

What you then do with the results is another story. Can be used to determine if a parametric test is applicable or not.

ADD COMMENTlink written 9.0 years ago by Michael Dondrup46k

If you have so many replicates, wilcoxon's rank-sum test should have sufficient power for a two-sample comparison. And then you won't rely on any normality assumption.

ADD REPLYlink written 9.0 years ago by Michael Dondrup46k

Yes, I definately accept your reasoning. I wish to test this merely as a diagnostic before implementing further analyses using ranks alone. Personally, non-parametric methods are very dissatisfying but alas, such is data.

ADD REPLYlink written 9.0 years ago by Darren J. Fitzpatrick1.1k
gravatar for Neilfws
9.0 years ago by
Sydney, Australia
Neilfws48k wrote:

Good statistical advice in the answers above. For those looking for tests of multi-normality without the restrictions of mvnormtest, here are some options from CRAN Task View: Multivariate Statistics.

I've tested them on a small matrix (22283 x 6); note that the methods in the energy package can take a very long time to run.

ADD COMMENTlink written 9.0 years ago by Neilfws48k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 745 users visited in the last hour