Question

Using linear regression to accounting for confounding variables in gene expression data

1

Entering edit mode

5.3 years ago

newbio17 ▴ 360

My data consists of 20,000 rows (genes) and 300 columns (samples). 5 out of 300 are cell lines and 295 out of 300 are tumor samples.

I'm currently attempting to adjust the expression values according to a confounding variable using linear regression. In summary, I have a dataframe consisting of the gene expression values and a vector of values of the confounding variable for each of the 300 samples.

Below is my first attempt at this:

design = model.matrix(~group+confounder)
fit = lmFit(df, design)
adjValues = fitted(fit)

The resulting design matrix looks something like:

           (Intercept) group confounder
Sample #1  1           0     1
Sample #2  1           0     1
Sample #3  1           0     1
Sample #4  1           0     1
Sample #5  1           0     1
# ======================================== Above is cell line; Below is tumor sample
Sample #6  1           1     .91
Sample #7  1           1     .75
...

I thought it would be straight forward, but doing this results in a weird problem where expression values of the 5 cell lines are identical for all of the genes. This seems to also happen when I change the design matrix into design = model.matrix(~confounder).

What is the problem with how I am currently employing linear regression to adjust gene expression accounting for the given confounder?

RNA-Seq linear regression • 1.9k views

ADD COMMENT • link updated 5.3 years ago by Charles Warden 8.2k • written 5.3 years ago by newbio17 ▴ 360

score 1 · Answer 1 · 2019-01-14

From what I can see the confounder isn't randomized among your groups ("0" for 5 cell lines is always 1, "1" for 295 tumors has different values). So, I would be skeptical about any results returned, even if you don't get an error message causing the program to crash.

Your sample design is also highly unbalanced, so I am not sure if that is a contributing factor.

I'm not sure what you are trying to study (between the cell lines and tumor samples). If you have treatment and control conditions in both, I would say perhaps separate cell line and tumor comparisons would help provide more weight to the cell line part (although, then the "type" of cell line or fresh tumor sample would be the confounder, and you treatment "group" would then be mixed between the variable that you are trying to adjust).

In general, I would always recommend visualizing expression among your differential expressed genes, to see if you model had it's intended effect (and/or if there were unintended sources of variation that may not be relevant to your biological question).

I'm guessing that you've already tested using functions from edgeR, limma-voom, and DESeq2 to identify differentially expressed genes.

Sorry that I can't help provide more specific advice, but are there other ways that you can think of addressing your biological question (which may or may not involve differential expression, and may or may not involve just using this high-throughput gene expression data)?