Question: Best Way To Quantify Influence Of Different Covariates On Gene Expression?
7
gravatar for Mikael Huss
5.5 years ago by
Mikael Huss4.7k
Stockholm
Mikael Huss4.7k wrote:

I am looking for input on ways to put numbers on how much different factors/covariates contribute to gene expression patterns. For example, let's say we have RNA-seq (or microarray) data from male and female mice that have received two different treatments (or no treatment) and the samples have been prepped in two different batches. Now we would like a bar plot (or similar) of how much the different covariates (1) sex, (2) treatment and (3) batch effects contribute, respectively, to the gene expression.

Some ideas:

  1. Use PCA on the expression matrix and calculate the correlation between PC1 scores and the covariate vectors.
  2. Use some machine learning algorithm to try to discriminate between male/female (or treatments, or batches), and evaluate by cross-validation how well this is working. Use the prediction performance as a measure of how strongly the covariate is reflected in the expression profile.
  3. Is there a way to find out this sort of summarized importance from the linear models fitted as part of limma/edgeR/DESeq workflows?

I'd be grateful for any thoughts.

microarray rna-seq • 7.7k views
ADD COMMENTlink modified 3.3 years ago by chris86290 • written 5.5 years ago by Mikael Huss4.7k

Do you want this for each gene or some sort of summarized metric for the whole experiment (I'm guessing the latter from your mention of just using PCA)?

ADD REPLYlink written 5.5 years ago by Devon Ryan92k

Yes, a summarized metric for the whole experiment.

ADD REPLYlink written 5.5 years ago by Mikael Huss4.7k
11
gravatar for Irsan
5.5 years ago by
Irsan7.0k
Amsterdam
Irsan7.0k wrote:

As a first step, it might be useful to remove genes with count 0 in more than 90% of your samples. Then obtain normalized and log-transformed expression estimates (logFPKM; maybe edgeR and voom() can be of use?) for all your samples and genes. I assume you already have obtained them somehow. Then prepare your data in such a way that you have a data frame where a row represents 1 measurement and the columns geneId, sex, treatment, batch, logFpkm (maybe aggregate() or melt() functions can be useful to reshape your data). Then perform anova and look at the sums of squares and the p-values of the F-test. Most likely, geneId has the biggest influence on your logFpkms and hopefully treatment is a good second.

fit <- lm(logFpkm ~  geneId + sex + treatment + batch,data=yourData)
anova(fit)

When you have many samples and many genes you can also take a random sample of your genes (say 1000) to get an idea and then scale up to bigger sample sizes.

Edit: Brentp, you are right. Its maybe better to exclude the geneId from the linear model

ADD COMMENTlink modified 3.3 years ago • written 5.5 years ago by Irsan7.0k

you mean to have a single table with n_samples * n_genes and fit one model to that?

ADD REPLYlink written 5.5 years ago by brentp23k
1

so take an expression matrix, melt it, add covariate info and fit model / do anova. Maybe there are better alternatives but it comes into my mind. I just tried it on one of my datasets and it said that geneId is by far the biggest source of variance, then tissue, then patientGroup, then batch which makes sense for my dataset.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Irsan7.0k

Thanks for the tip. I had been thinking along these lines but hadn't hatched the idea about using geneId as a feature. I'll definitely try it.

ADD REPLYlink written 5.5 years ago by Mikael Huss4.7k

you can also leave it out of the model of course. Its not surprising that it is the biggest source of variation.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Irsan7.0k

Would you please explain more about how to prepare data?

ADD REPLYlink written 3.1 years ago by abc30
2
gravatar for brentp
5.5 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

This is a good question, in ecology, for example, one might have a single set of data and they will scrutinize a model for years and find the "best" one. In genomics, we choose one model and apply it to N (million) probes / measurements / genes / CpG's / SNPs etc.

You could look at the distribution of p-values, either of your covariate of interest or of the potential confounder with and without the confounder. You can also look at the distribution of covariate estimates. If you see genomic inflation without correction for some important covariate, then that might be a good thing to check.

You could also do something like SVA or PEER on a simple model and see how the inferred surrogate variables correlate with the covariates that you did not include in the model (this is similar to your PCA idea).

Of course, it could be that different covariates only have an effect at certain sites in the genome--we know this will be true in the case of, gender and the sex chromosomes. For other effects, it will be more subtle. In the end, you will have to rely on biological insight to a large extent.

ADD COMMENTlink written 5.5 years ago by brentp23k

The SVA idea is interesting - I have been playing around with this package recently but hadn't thought to apply it to this question.

ADD REPLYlink written 5.5 years ago by Mikael Huss4.7k
1
gravatar for Charles Warden
5.5 years ago by
Charles Warden7.3k
Duarte, CA
Charles Warden7.3k wrote:

I would check the signal distributions: if the male and female samples look very different (for example, I have seen this be the case for germ-line samples), I would just analyze them separately.

The most direct answer to your question is that you can use a 2-way or 3-way ANOVA to factor out the influence of covariates. It's not super rigorous, but I think you can use something like the average sum of squares to give an idea about the relative contribution. For example, I know Partek makes a sources of variation box-plot like you described, and I think that is how they do it.

DESeq, edgeR, etc. also have workflows for multivariate analysis (I believe using a generalized linear model in both of those cases).

ADD COMMENTlink written 5.5 years ago by Charles Warden7.3k
1

Thanks for your comments. I have used DESeq, edgeR and limma to look at these data, but I am not sure what is the best way to use information about the fitted coefficients to something in general about the covariates. I should say that in my case the covariates are not in fact sex, treatment and batch but three other things; those were just an example.

ADD REPLYlink written 5.5 years ago by Mikael Huss4.7k
1
gravatar for chris86
3.3 years ago by
chris86290
United Kingdom, London
chris86290 wrote:

Edit: This is much better. https://github.com/dswatson/bioplotr/blob/master/R/plot_drivers.R

ADD COMMENTlink modified 5 weeks ago • written 3.3 years ago by chris86290

Hi @chris86, I had a look at this package, and I believe it allows to assess the effect of factors / categorical variables (e.g. sex) on RNA-seq data, but it does not allow to assess the effect of numerical covariates (e.g. weight) on the data.

ADD REPLYlink written 6 weeks ago by rodd50

This is what you want: https://github.com/dswatson/bioplotr/blob/master/R/plot_drivers.R.

I suggest everyone check out this function for assessing the drivers of variation, it requires installation of David Watsons bioplotr package.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by chris86290
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1970 users visited in the last hour