I have realized that usually gene expression data (e.g. seq data) should be transformed using log2 instead of using e.g. log10 transformation. Why log2 transformation is commonly used but not other transformation? I would like to understand a basic theory behind log2 transformation linked to gene expression data.
When it is used, a main rationale for log-transformation is heteroskedasticity. The variance of expression measurements on many platforms (arrays, etc.) depends on the expression level. By log-transforming, you reduce this dependence and your data becomes better-behaved for statistical testing. As pointed out by russhh - the choice of the base 2 is just a practical one. Many other transformations can be applied to expression data. The "best" one likely depends on your measurement platform and your analysis application. For example, see variance stabilizing transformations like VST in the DESeq package. Log2 has a long history because it's simple, and it's an improvement on using raw values for statistical analysis in many cases.