Question: Is okay to apply ANOVA to Log2 transformed data ?
1
morovatunc400 wrote:

Dear all Hi,

I would like to know if I can apply anova to my log2 transformed data.

We have SNV + INDEL mutation count data for ICGC data and we trying to prove in specific regions Breast Cancer have higher rate of mutations. But because the total number of mutations in data is too dispersed we apply log2 transformation. Therefore I would like to know if is it okay to apply anova in this case because all the mean and variance calculations are different in log2 scale.

My aim is to prove the means are not same across cancers then post hoc test to perform individual comparisons with t.test.

Best regards,

Tunc.

statistics anova • 1.3k views
modified 3.2 years ago by Lemire460 • written 3.2 years ago by morovatunc400

ANOVA has three assumptions (this is from wiki!):

• Independence of observations
• Normal distribution
• Equal variance

If your log2 transformed data comply with these assumptions, you can try ANOVA.

2
lukas.kall20 wrote:

ANOVA investigates the relationship between sample group vs within sample group variance, which roughly translates to the difference in means between the sample groups and assumed additive noise within your sample group. However, if you log your data your error model become different. At that point, you are (again roughly) investigating the relationship of assumed multiplicative noise to the difference in geometric means between your sample groups. I do not know if it is or is not reasonable to assume that SNV + INDEL mutation count data is best modeled by multiplicative or additive noise. However, you can easily find out by plotting the in-group variance as a function of the mean intensity of the sample group. If there is a linear correlation between the two entities I would say that the error is multiplicative. If the variance is invariant of intensity it is additive.

I think Lukas's caution is right and I think you should test your assumptions in a model before you apply it.

I don't know the ANOVA well but I would expect that, like the t-test, it also assumes that the observed variance will itself will follow some sort of distribution. In a t-test the variance that is measured from repeated draws from a normal distribution will follow a scaled chi square distribution. This distribution is used with the normal distribution to build the t-distribution. If you log transform the data the variance may no longer follow this distribution and your false positive rate can be off from your p-value.

In the t-test I found that I actually got a more accurate answer on data that followed a lognormal distribution without log transforming the data as long as the tail wasn't ridiculous. The assumption of normalcy is pretty robust but with a small sample size the error model is really important.

1
Lemire460 wrote:

If you are concerned about the underlying parametric assumptions, you could try Kruskal-Wallis, the non-parametric version of the ANOVA. It's rank-based, meaning you can apply it to the non-transformed data.