Forum:Is there ever a good reason to use a log transformation instead of Box-Cox?
2
7
Entering edit mode
23 months ago
Mensur Dlakic ★ 25k

I often deal with non-normal data distributions, and for some downstream application it is needed to transform the data to at least resemble a normal distribution. It seems that the logarithm transformation (LT) is most often used in the literature, followed by sqrt(Y) and sometimes 1/sqrt(Y). Questions are often asked here about data standardization as well, and LT usually gets a mention.

I am wondering why this is the case. There is a power transformation called a Box-Cox transformation (BC) that has been around since mid-60s, so it is not a new kid on the block. This transformation relies on finding a factor lambda, and then transforms the Y data by a simple Y' = ((Y^lambda)-1) / lambda. This is for all lambda != 0. For a special case where lambda=0, ln(Y) is applied. It is fairly easy to calculate lambda, so I don't think that's the issue.

In the best case (when lambda=0), the LT will unskew the data to the same degree as BC. In all other cases the BC will do a better job, both visually and in terms of objective numbers (e.g., data skew close to zero).

I did a little exercise here by creating highly skewed data in the left column, and then somewhat skewed data in the right column below. Then the LT and BC were applied to both datasets, and the resulting distributions are shown along with their skew. While the LT removes some of the data skew (and better so when the data is not too skewed to begin with), I hope it is clear that BC does a better job. As I said, lambda can be calculated fairly rapidly and unambiguously, and such calculation is shown here for the left data column. My question is why do so many people still use the LT to get data closer to normality when they could be using the BC? Is it ignorance? Is it laziness because the LT is a smidge faster to calculate than BC? Or did someone do the actual analysis and show that in most cases lambda is in the [-0.5, 0.5] range? If so, that would explain the choices 1/sqrt(Y) because that is the transformation when lambda=-0.5, ln(Y) when lambda=0, and sqrt(Y) when lambda=0.5.

Still, even if the latest assumption is the reason why, the BC does a better job at minimal extra time.

Here is the code if you want to run the above exercise on your own.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import spearmanr, skew, boxcox, boxcox_normmax, norm, probplot, boxcox_normplot

data_l = np.random.beta(0.5, 50, 100000)
data_l *= 10000
print('\n Skew for heavily biased data: %.4f' % skew(data_l) )
plt.figure(figsize = (8, 8))
sns.distplot(data_l)
plt.tight_layout()
plt.show()

data_r = np.random.beta(2, 100, 100000)
data_r *= 1000
print('\n Skew for biased data: %.4f' % skew(data_r) )
plt.figure(figsize = (8, 8))
sns.distplot(data_r)
plt.tight_layout()
plt.show()

tdata_l_log = np.log(data_l)
print('\n Skew for log-transformed heavily biased data: %.4f' % skew(tdata_l_log) )
lam_ = boxcox_normmax(data_l, method='mle')
fig = plt.figure(figsize=(6,6))
prob = boxcox_normplot(data_l, -1, 2, plot=ax, N=200)
ax.axvline(lam_, color='r')
ax.text(lam_+0.01,0.7,('%.4f' % lam_),rotation=0)
plt.tight_layout()
plt.savefig('lambda-search-1.png', dpi=150)
plt.show()
tdata_l = ((data_l**lam_)-1)/lam_
tdata_l, lam, alpha_interval = boxcox(x=data_l, alpha=0.05)
print('\n Skew, lambda and 95%% lambda confidence interval for Box-Cox-transformed heavily biased data: %.4f %.4f (%.4f, %.4f)' % (skew(tdata_l), lam, alpha_interval, alpha_interval) )
tdata_r_log = np.log(data_r)
print('\n Skew for log-transformed biased data: %.4f' % skew(tdata_r_log) )
lamr_ = boxcox_normmax(data_r, method='mle')
fig = plt.figure(figsize=(6,6))
prob = boxcox_normplot(data_r, -1, 2, plot=ax, N=200)
ax.axvline(lamr_, color='r')
ax.text(lamr_+0.01,0.7,('%.4f' % lamr_),rotation=0)
plt.tight_layout()
plt.savefig('lambda-search-2.png', dpi=150)
plt.show()
tdata_r = ((data_r**lamr_)-1)/lamr_
tdata_r, lamr, alpha_r_interval = boxcox(x=data_r, alpha=0.05)
print('\n Skew, lambda and 95%% lambda confidence interval for Box-Cox-transformed biased data: %.4f %.4f (%.4f, %.4f)' % (skew(tdata_r), lamr, alpha_r_interval, alpha_r_interval) )

fig, ax=plt.subplots(3,2, figsize=(8,12))
sns.distplot(data_l, axlabel='Large skew distribution (skew: %.4f)' % skew(data_l), ax=ax)
sns.distplot(data_r, axlabel='Smaller skew distribution (skew: %.4f)' % skew(data_r), ax=ax)
sns.distplot(tdata_l_log, axlabel='Log transformation (skew: %.4f)' % skew(tdata_l_log), ax=ax)
sns.distplot(tdata_l, axlabel='Box-Cox transformation ($\lambda$: %.4f skew: %.4f\n$\lambda$ 95%% confidence interval: %.4f, %.4f)' % (lam, skew(tdata_l), alpha_interval, alpha_interval), ax=ax)
sns.distplot(tdata_r_log, axlabel='Log transformation (skew: %.4f)' % skew(tdata_r_log), ax=ax)
sns.distplot(tdata_r, axlabel='Box-Cox transformation ($\lambda$: %.4f skew: %.4f)\n$\lambda$ 95%% confidence interval: %.4f, %.4f' % (lamr, skew(tdata_r), alpha_r_interval, alpha_r_interval), ax=ax)
plt.tight_layout()
plt.savefig('skew-demo.png', dpi=150)
plt.show()

data skew transformation power • 5.7k views
1
Entering edit mode

I've tried both of these and never had the result actually pass a shapiro-wilk test of normality. I wonder if there is an R function that can cherry pick values from a non-normal distribution to form a normal distribution.

2
Entering edit mode

Neither transformation is guaranteed to bring the data to normality, but the idea is to bring it as close as possible to a normal distribution. That is usually enough for most applications.

Below are two real distributions of RNAseq counts. Don't mean to mix that with other transformations that may be more appropriate for this specific type of data, but just to make this point: even when neither of the two transformed datasets are normally distributed, one can be closer to normality than the other.  1
Entering edit mode

Amusing historical tidbit on the Box-Cox naming:

George Box and Sir David Cox collaborated on one paper (Box, 1964). The story is that while Cox was visiting Box at Wisconsin, they decided they should write a paper together because of the similarity of their names (and that both are British). In fact, Professor Box is married to the daughter of Sir Ronald Fisher.

https://onlinestatbook.com/2/transformations/box-cox.html

0
Entering edit mode

By the way, there is a Yeo-Johnson transformation which is a special case of Box-Cox that can be used to transform negative data points. Yet another reason to favor it over the log transformation.

4
Entering edit mode
23 months ago
Ram 40k

In my case, I was unaware of the Box-Cox transformation. Even now that I have been introduced to it, I have a few questions that make me balk a little:

1. How easy is it for a non-stats/data science person to wrap their heads around this concept?
2. How easy is it to go the other way (log/log2 is as simple as 10**/2**)
3. Combining the two above, how easy is it for biologists to look at 2 BC values and mentally picture how the underlying values compare?
2
Entering edit mode

In most of my applications it doesn't matter what power transformation is applied, as long as it can be reversed and it makes more sense for downstream data manipulations. I am not talking here about transformations that are done to squish the data into smaller space for visualization (e.g., log-transformed fold changes). I am primarily talking about transforming data because some linear methods for example will not work otherwise.

• I don't think it is more difficult to understand this transformation than log with some practice. Of course, that may not be universally true because people think differently.
• The inverse transformation is easy. When lambda=0, it is the same as for log transformation: exp(Y'). When lambda != 0, it is exp(log(lambda*Y'+1)/lambda)).
• The same answer as above: it is a matter of practice, but probably not as easy as log10 transformation.
4
Entering edit mode
23 months ago

I think log is often used because not only does it make variance more homoskedastic in many cases (which is probably what we are interested in, rather that normality per se), but that logs also have very useful mathematical properties and they are easy to manipulate. So if you want to calculate the ratio of two values, you can subtract them on the log scale, for example.

A deeper reason, partially related to the above, is that logs naturally represent phenomena that operate on a multiplicative scale. This is true of many things in biology - input signal to output activities in gene regulation often involve exponentials, and if you are interested in the afinity of one thing for another (e.g. ChIP-seq) then you are probably interested in the fold enrichment of one thing over another. That is, if double my input (e.g. input DNA), I expect to double my output. This actually gets to one of the problems people have with logs - you can't do the log of 0. But I think probably, zero is something that doesn't actually occur very often in biology. All things have some affinity for each other, even if its very small, and all genes are swtiched on to some extent even if its very little. We often measure counts, which can be zero, but we are often doing so to estimate underlying parameters that are almost certinaly not zero. Where the phenomena we are studying aren't naturally on a multiplicative scale, then we tend to be using techniques borrowed from fields where they are.

0
Entering edit mode

I don't know this for a fact, but my intuition is that greater normality would usually mean better homoscedasticity. I will research that a bit, but for now there is at least one article I found by searching quickly that says Box-Cox is better at reducing heteroscedasticity than log transformation. I agree, as I did with Ram, that log-transformed values may be more interpretable.

I will accept your premise about logs and exponentials in biological systems, even though there are many examples of a power low dependency. The thing is, the best I can tell people in all kinds of fields unrelated to biology also tend to favor log transformation over Box-Cox, even when there is no interpretability of numbers required. As I said before, I am primarily talking about proper data analysis rather than visualizing the data.

It is my experience that people completely ignore the fact that Box-Cox will transform the data closer to normality, and maybe the reason is that they can get away with it because a log transformation is good enough.