How to normalize gene expression values for finding similar genes
0
0
Entering edit mode
8.4 years ago
ciclustigu ▴ 10

I am a bit confused on the kind of data transformation I have to use for analyzing gene expression data.

I have a matrix X, where rows are genes, columns are individuals, and entry X_{i,j} is a value of expression for gene i in individual j. I want to compare pairwise similarity between genes X_{i,:} and X_{j,:}.

Since values can vary a lot among individuals, and one individual could have high values for all the genes, that individual could have in general a higher impact on the similarities/dissimilarities between genes than other individuals.
Therefore, I think I have to "transform" my data somehow. However, it is not clear how to proceed.

Log transformation seems to be quite common in bioinformatics. It simply transforms each entry in the matrix $X$ considering the logarithm.

Another idea is to normalize the data matrix. But I do not understand if I have to impose norm 1 on each row or on each column.

On the one hand, if I impose norm one for each row, I am forcing each gene to have the same norm, or in biological term, the same magnitude of expressiveness along individuals.

On the other hand, if I impose norm one for each column, I am forcing each individual to have the same norm, or in biological term, the same magnitude of expressiveness along genes. That seems to be more reasonable to me, because if I compute the Euclidean distance between gene 1 and gene 2 I have to map their expression values to the same "order of magnitude", so that a distance along individuals should have approximately the same "influence" to the overall distance between the two genes.

So, how do you transform gene expression values to compare gene similarities?

Should I normalize the rows (same norm for all genes) or the columns (same norm for all individuals)?

gene RNA-Seq • 6.0k views
1
Entering edit mode

Various RNA-Seq normalization methods were compared in a paper for your reference ([Dillies et al., 2013, ref1]). However, what method is better heavily depends on your biological questions and data structures. Usually, I use TMM to normalize my RNA-Seq data ([Robinson & Oshlack, 2010, ref-3]), but it is case by case. Just based on your description, a special normalization and meta-analysis method NextBio (has been bought by Illumina) developed could be helpful, I hope so ([Kupershmidt et al., 2010, ref-2]).

• ref-1: Dillies, M.A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, G., Castel, D., Estelle, J., Guernec, G., Jagla, B., Jouneau, L., Laloe, D., Le Gall, C., Schaeffer, B., Le Crom, S., Guedj, M., Jaffrezic, F. & French StatOmique, C. (2013). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 14, 671-683.
• ref-2: Kupershmidt, I., Su, Q.J., Grewal, A., Sundaresh, S., Halperin, I., Flynn, J., Shekar, M., Wang, H., Park, J., Cui, W.W., Wall, G.D., Wisotzkey, R., Alag, S., Akhtari, S. & Ronaghi, M. (2010). Ontology-based meta-analysis of global collections of high-throughput public data. PLoS ONE 5.
• ref-3: Robinson, M.D. & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11.
0
Entering edit mode

What do you want to do downstream with the data? There are a number of common normalization schemes currently in use, you'd be advised to use them rather than rolling your own method unless you really have a good reason.

0
Entering edit mode

I want to compute "similarity" between genes, that could be a distance or co-expression.

Could you make explicit the normalization schemes?

1
Entering edit mode

You'll need to think about how your distance metric is affected by your "normalization". Here is a little discussion of distance metrics and a graphical example of how the concept of distance is important.

https://youtu.be/tln64P-w_8c

In practice, using correlation between genes is often a good choice since we do not care about the absolute differences between gene expression measures, only relative changes. Normalizing by row for visualization purposes is also a common approach.

0
Entering edit mode

You might take an approach like that recommended by WGCNA, which you might also want to use. In short, you take normalized values from DESeq2 or similar and compute covariation with those. I'm guessing that you want to build a network or cluster things into modules, thus the WGCNA mention.