I often deal with non-normal data distributions, and for some downstream application it is needed to transform the data to at least resemble a normal distribution. It seems that the logarithm transformation (LT) is most often used in the literature, followed by `sqrt(Y)`

and sometimes `1/sqrt(Y)`

. Questions are often asked here about data standardization as well, and LT usually gets a mention.

I am wondering why this is the case. There is a power transformation called a Box-Cox transformation (BC) that has been around since mid-60s, so it is not a new kid on the block. This transformation relies on finding a factor lambda, and then transforms the Y data by a simple `Y' = ((Y^lambda)-1) / lambda`

. This is for all lambda != 0. For a special case where lambda=0, `ln(Y)`

is applied. It is fairly easy to calculate lambda, so I don't think that's the issue.

In the best case (when lambda=0), the LT will unskew the data to the same degree as BC. In all other cases the BC will do a better job, both visually and in terms of objective numbers (*e.g.*, data skew close to zero).

I did a little exercise here by creating highly skewed data in the left column, and then somewhat skewed data in the right column below. Then the LT and BC were applied to both datasets, and the resulting distributions are shown along with their skew. While the LT removes some of the data skew (and better so when the data is not too skewed to begin with), I hope it is clear that BC does a better job.

As I said, lambda can be calculated fairly rapidly and unambiguously, and such calculation is shown here for the left data column.

My question is why do so many people still use the LT to get data closer to normality when they could be using the BC? Is it ignorance? Is it laziness because the LT is a smidge faster to calculate than BC? Or did someone do the actual analysis and show that in most cases lambda is in the `[-0.5, 0.5]`

range? If so, that would explain the choices `1/sqrt(Y)`

because that is the transformation when lambda=-0.5, `ln(Y)`

when lambda=0, and `sqrt(Y)`

when lambda=0.5.

Still, even if the latest assumption is the reason why, the BC does a better job at minimal extra time.

Here is the code if you want to run the above exercise on your own.

```
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import spearmanr, skew, boxcox, boxcox_normmax, norm, probplot, boxcox_normplot
data_l = np.random.beta(0.5, 50, 100000)
data_l *= 10000
print('\n Skew for heavily biased data: %.4f' % skew(data_l) )
plt.figure(figsize = (8, 8))
sns.distplot(data_l)
plt.tight_layout()
plt.show()
data_r = np.random.beta(2, 100, 100000)
data_r *= 1000
print('\n Skew for biased data: %.4f' % skew(data_r) )
plt.figure(figsize = (8, 8))
sns.distplot(data_r)
plt.tight_layout()
plt.show()
tdata_l_log = np.log(data_l)
print('\n Skew for log-transformed heavily biased data: %.4f' % skew(tdata_l_log) )
lam_ = boxcox_normmax(data_l, method='mle')
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)
prob = boxcox_normplot(data_l, -1, 2, plot=ax, N=200)
ax.axvline(lam_, color='r')
ax.text(lam_+0.01,0.7,('%.4f' % lam_),rotation=0)
plt.tight_layout()
plt.savefig('lambda-search-1.png', dpi=150)
plt.show()
tdata_l = ((data_l**lam_)-1)/lam_
tdata_l, lam, alpha_interval = boxcox(x=data_l, alpha=0.05)
print('\n Skew, lambda and 95%% lambda confidence interval for Box-Cox-transformed heavily biased data: %.4f %.4f (%.4f, %.4f)' % (skew(tdata_l), lam, alpha_interval[0], alpha_interval[1]) )
tdata_r_log = np.log(data_r)
print('\n Skew for log-transformed biased data: %.4f' % skew(tdata_r_log) )
lamr_ = boxcox_normmax(data_r, method='mle')
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)
prob = boxcox_normplot(data_r, -1, 2, plot=ax, N=200)
ax.axvline(lamr_, color='r')
ax.text(lamr_+0.01,0.7,('%.4f' % lamr_),rotation=0)
plt.tight_layout()
plt.savefig('lambda-search-2.png', dpi=150)
plt.show()
tdata_r = ((data_r**lamr_)-1)/lamr_
tdata_r, lamr, alpha_r_interval = boxcox(x=data_r, alpha=0.05)
print('\n Skew, lambda and 95%% lambda confidence interval for Box-Cox-transformed biased data: %.4f %.4f (%.4f, %.4f)' % (skew(tdata_r), lamr, alpha_r_interval[0], alpha_r_interval[1]) )
fig, ax=plt.subplots(3,2, figsize=(8,12))
sns.distplot(data_l, axlabel='Large skew distribution (skew: %.4f)' % skew(data_l), ax=ax[0][0])
sns.distplot(data_r, axlabel='Smaller skew distribution (skew: %.4f)' % skew(data_r), ax=ax[0][1])
sns.distplot(tdata_l_log, axlabel='Log transformation (skew: %.4f)' % skew(tdata_l_log), ax=ax[1][0])
sns.distplot(tdata_l, axlabel='Box-Cox transformation ($\lambda$: %.4f skew: %.4f\n$\lambda$ 95%% confidence interval: %.4f, %.4f)' % (lam, skew(tdata_l), alpha_interval[0], alpha_interval[1]), ax=ax[2][0])
sns.distplot(tdata_r_log, axlabel='Log transformation (skew: %.4f)' % skew(tdata_r_log), ax=ax[1][1])
sns.distplot(tdata_r, axlabel='Box-Cox transformation ($\lambda$: %.4f skew: %.4f)\n$\lambda$ 95%% confidence interval: %.4f, %.4f' % (lamr, skew(tdata_r), alpha_r_interval[0], alpha_r_interval[1]), ax=ax[2][1])
plt.tight_layout()
plt.savefig('skew-demo.png', dpi=150)
plt.show()
```

I've tried both of these and never had the result actually pass a shapiro-wilk test of normality. I wonder if there is an R function that can cherry pick values from a non-normal distribution to form a normal distribution.

Neither transformation is guaranteed to bring the data to normality, but the idea is to bring it as close as possible to a normal distribution. That is usually enough for most applications.

Below are two real distributions of RNAseq counts. Don't mean to mix that with other transformations that may be more appropriate for this specific type of data, but just to make this point: even when neither of the two transformed datasets are normally distributed, one can be closer to normality than the other.

Amusing historical tidbit on the Box-Cox naming:

https://onlinestatbook.com/2/transformations/box-cox.html

By the way, there is a

Yeo-Johnson transformationwhich is a special case of Box-Cox that can be used to transform negative data points. Yet another reason to favor it over the log transformation.