Question: Adjusting for confounding variables in RNA-SEQ Differential Expression Analysis
2
gravatar for ucakhnd
2.4 years ago by
ucakhnd20
ucakhnd20 wrote:

My project is working on a large dataset of RPKM values for patients with and without Schizophrenia.

After some preprocessing steps including dumping genes with lots of zero RPKM values and log2-transforming, I have applied Non-negative Matrix Factorization (NNMF) as a dimensionality reduction technique. I am looking for statistically significant correlations between the resulting groups of genes ('metagenes') and schizophrenia.

Until now, I have been using a simple t-test, with Bonferroni correction, to test the metagene expression values for correlation with Schizophrenia. I think that the normality condition is fine because there are about 150 cases and 170 controls - so CLT holds. Some of the results have such very low adjusted p-values that I am relatively certain I have found something interesting.

However, I need to be sure absolutely sure that this is not down to confounding factors. There are slight imbalances by demographic in the schizophrenia vs. non-schizophrenia groups - I need to correct for a few variables, both discrete and continuous - the full list I want to correct for is: Age, Sex, Race, Smoking or not, Postmortem interval, sample pH, and RNA integrity number.

Is there a statistical test, more advanced than the t-test, that can be applied that will ACCOUNT for the impact of these confounding covariates, and make sure that I really have found statistically significant correlations with Schizophrenia? If there is not, then can you recommend how I could change my procedure to best guard against the the confounding factors?

Want to make sure I'm reporting solid results! Thanks for any help you can give.

ADD COMMENTlink modified 2.4 years ago by btsui290 • written 2.4 years ago by ucakhnd20
1
gravatar for Kevin Blighe
2.4 years ago by
Kevin Blighe69k
Republic of Ireland
Kevin Blighe69k wrote:

Unfortunately, RPKM data is not ideal for the purposes of cross-sample differential expression analysis. Please read THIS. In their key points:

The Total Count and RPKM normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Logging (base 2) RPKM data may make it's distribution look more even, but the underlying issue still persists. I suggest that you go back to obtain the raw counts and re-normalise by TMM (edgeR) or using DESeq2's method.

----------------------------------------------------------

Nevertheless, in order to 'adjust' for confounding factors / covariates, you should probably first check whether these factors are actually influential (statistically) on schizophrenia by testing each independently:

MyData
  Schizophrenia  Gene1   ...  Age  Sex   ...
1         Y       2.33   ...  45     M   ...
2         Y       3.21   ...  43     M   ...
3         N       1.21   ...  26     F   ...
4         N       2.11   ...  35     T   ...
...

Check encoding:

MyData$Schizophrenia <- factor(MyData$Schizophrenia, levels=c("N","Y"))
MyData$Age <- as.numeric(MyData$Age)
et cetera

Check each in a logistic regression model:

summary(glm(Schizophrenia ~ Age, data=MyData))
summary(glm(Schizophrenia ~ Sex, data=MyData))
summary(glm(Schizophrenia ~ Race, data=MyData))
summary(glm(Schizophrenia ~ SmokingStatus, data=MyData))
summary(glm(Schizophrenia ~ PostmortemInterval, data=MyData))
summary(glm(Schizophrenia ~ SamplepH, data=MyData))
summary(glm(Schizophrenia ~ RIN, data=MyData))

If either of these are not statistically significant, you may consider leaving them out. You then test each gene independently and include the statistically significant factors in the model formula:

For example, if Age , Sex, SmokingStatus, and PostmortemInterval model <- glm(Schizophrenia ~ Gene1 + Age + Sex + SmokingStatus + PostmortemInterval, data=MyData) summary(model)

model <- glm(Schizophrenia ~ Gene2 + Age + Sex + SmokingStatus + PostmortemInterval, data=MyData)
summary(model)

et cetera

If you need to set this up as a loop over all genes:

genelist <- c("Gene1", "Gene2", ..., "GeneX")
for (i in 1:length(genelist)) {
   formula <- as.formula(paste("Schizophrenia ~ ", genelist[i], " + Age + Sex + SmokingStatus + PostmortemInterval", sep=""))
   model <- glm(formula, data = MyData)
   print(summary(model))
}

This code could easily be done via lapply, and/or parallelised via mclapply or foreach

Kevin

ADD COMMENTlink modified 19 months ago • written 2.4 years ago by Kevin Blighe69k

Hi Kevin, Thanks for your answer. What if either of these is statistically significant happened in my sample information?

ADD REPLYlink written 19 months ago by chuanjiao60

I do not understand what you mean

ADD REPLYlink written 19 months ago by Kevin Blighe69k

I am sorry. I mean if two varies are confounded together, how to deal with this stuff?

ADD REPLYlink written 19 months ago by chuanjiao60
1
gravatar for btsui
2.4 years ago by
btsui290
UCSD
btsui290 wrote:

Q: However, I need to be sure absolutely sure that this is not down to confounding factors.

A: There are ways to minimize the possibilities of being confounded by technical covariates, but my understanding is that there is no way you can zero out the possibility of being confounded. That's why high impact journals always ask for an independent validation cohort.

Q: Is there a statistical test, more advanced than the t-test, that can be applied that will ACCOUNT for the impact of these confounding covariates, and make sure that I really have found statistically significant correlations with Schizophrenia? If there is not, then can you recommend how I could change my procedure to best guard against the confounding factors?

A: Before thinking about what kind of test, it's probably better to just write out the null hypothesis that you are testing. In general, the hardest part is deriving a null background that is not BS. In that case, you probably should show us the distribution of the data. The expression data usually have normal distribution after dropping the 0 and log2 transform. TL/DR: a simple regression model with indicator variables for the discrete variables should do the job for your case. At least the coefficients will be useful for telling you how each factor is contributing to the gene expression. Wald test will also show whether the factors are "significant" in the regression model or not.

ADD COMMENTlink written 2.4 years ago by btsui290
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2274 users visited in the last hour