Question

perplexed regarding the process of gene expression normalision

0

Entering edit mode

16 months ago

DS ▴ 10

Hello everyone,

I have a gene expression matrix consisting of 10 genes (rows) and 300 samples (columns). My objective is to prepare this dataset for eQTL analysis.

It is necessary to convert the data into a standard normal distribution when conducting tests on my linear mixed model.

Currently, I am working with TPM as my starting point. Here are the steps I plan to take:

Apply a log transformation to TPM+1 and I intend to perform this operation on a per-gene basis.
Either perform Quantile Normalization or Median Normalization - I believe it would be best to apply this step on a per-sample basis.
Standardize each gene by scaling them so that the mean equals 0 and the standard deviation equals 1.

Can someone confirm if my thought process is correct? Any other comments are welcome.

gene RNA-seq expression normalization • 1.1k views

ADD COMMENT • link 16 months ago by DS ▴ 10

score 1 · Answer 1 · 2024-02-23

I don't know what the norm is in the eQTL field. If you are going to be doing a normalisation on a per-sample basis, you almost certainly need to be doing it to more than 10 genes, in particular you definately can't do Quantile Normalisation with only 10 data points in each sample.

Because in eQTL you are comparing the same gene in many samples, I would argue for not using TPM. Instead, if the data is available I would start with raw counts, and then apply DESeq2's rLog transformation, which will take care of both normalisation and transformation to something approaching a normal distribution. rLog can be slow on large data samples (you'll still need all the genes to do a normalisation). I've not experimented with the VST transcform, but that also makes gene counts homoskedestic, on a log scale, and works faster than rLog.

score 0 · Answer 2 · 2024-02-23

Is "10" genes a typo? Maybe "10K"? Quantile and Median normalization are probably not going to do what you expect if you only have 10 genes.

For (2), these normalization steps are only defined to operate on a per-sample basis. Quantile normalization (in point of fact CQN) is commonly applied for eQTLs (and commonly skipped, as it can inadvertently deflate values for genes that have a low-frequency eQTL).

Sample outlier exclusion via PCA is also commonly applied - I don't see that step here.

Also some kind of correction for known covariates (sex, age, RIN, sequencing depth) and unknown sources of variation (PEER, RUV, SVA, etc) are applied for eQTLs - I don't see that here either. You may be intending to include these in your mixed-effect model -- but do compute and include something like PEER factors.

Z-scaling is typical and usually occurs twice (once before covariate correction, and again afterwards -- the second rescaling is not strictly necessary since it only impacts eQTL effect sizes but not p-values).