Using linear regression to accounting for confounding variables in gene expression data
1
1
Entering edit mode
5.3 years ago
newbio17 ▴ 360

My data consists of 20,000 rows (genes) and 300 columns (samples). 5 out of 300 are cell lines and 295 out of 300 are tumor samples.

I'm currently attempting to adjust the expression values according to a confounding variable using linear regression. In summary, I have a dataframe consisting of the gene expression values and a vector of values of the confounding variable for each of the 300 samples.

Below is my first attempt at this:

design = model.matrix(~group+confounder)
fit = lmFit(df, design)
adjValues = fitted(fit)

The resulting design matrix looks something like:

           (Intercept) group confounder
Sample #1  1           0     1
Sample #2  1           0     1
Sample #3  1           0     1
Sample #4  1           0     1
Sample #5  1           0     1
# ======================================== Above is cell line; Below is tumor sample
Sample #6  1           1     .91
Sample #7  1           1     .75
...

I thought it would be straight forward, but doing this results in a weird problem where expression values of the 5 cell lines are identical for all of the genes. This seems to also happen when I change the design matrix into design = model.matrix(~confounder).

What is the problem with how I am currently employing linear regression to adjust gene expression accounting for the given confounder?

RNA-Seq linear regression • 1.9k views
ADD COMMENT
1
Entering edit mode
5.3 years ago

From what I can see the confounder isn't randomized among your groups ("0" for 5 cell lines is always 1, "1" for 295 tumors has different values). So, I would be skeptical about any results returned, even if you don't get an error message causing the program to crash.

Your sample design is also highly unbalanced, so I am not sure if that is a contributing factor.

I'm not sure what you are trying to study (between the cell lines and tumor samples). If you have treatment and control conditions in both, I would say perhaps separate cell line and tumor comparisons would help provide more weight to the cell line part (although, then the "type" of cell line or fresh tumor sample would be the confounder, and you treatment "group" would then be mixed between the variable that you are trying to adjust).

In general, I would always recommend visualizing expression among your differential expressed genes, to see if you model had it's intended effect (and/or if there were unintended sources of variation that may not be relevant to your biological question).

I'm guessing that you've already tested using functions from edgeR, limma-voom, and DESeq2 to identify differentially expressed genes.

Sorry that I can't help provide more specific advice, but are there other ways that you can think of addressing your biological question (which may or may not involve differential expression, and may or may not involve just using this high-throughput gene expression data)?

ADD COMMENT
1
Entering edit mode

To clarify, I'm attempting to replicate the correlation analysis that was done in the paper below: https://www.biorxiv.org/content/biorxiv/early/2018/09/20/422592.full.pdf

One of the steps that they perform is adjusting the expression levels of all samples according to tumor purity using linear regression, which I am trying to replicate. Now I'm wondering if it would be more appropriate to run regression on each of the samples instead.

ADD REPLY
0
Entering edit mode

Taking the time to critically assess papers (and answer questions about your own work) is important, and suspect this requires people to be able to study a limited number of topics in-depth.

In other words, I am not immediately sure what to say about this specific paper. However, if I have a chance to review the paper with some thoughts, then I will update with an additional comment.

ADD REPLY

Login before adding your answer.

Traffic: 2571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6