Tools for screening the influence/importance of covariates in multidimensional data
2
1
Entering edit mode
4 months ago
Papyrus ★ 1.3k

Hi all,

I'm looking for tools which can be used to check the importance of covariates (either continuous or categorical) in explaining information in data (e.g. gene expression data), so as to screen which variables one may want to adjust for when testing in a linear model framework (in limma, DESeq2, etc.).

For example, I have often used the pcrplot of the ENmix R package, which correlates variables to principal components and gives this useful plot:

(And of course there is always visual screening of the PCAs coloring by variables).

But I'm wondering if anyone knows of more sophisticated methods, or methods from which one can extract more "objective" stats to justify subsequent inclusion/exclusion of variables in the models. For example I've seen the R package pvca but it only works with categorical covariates.

Or else, what is your usual process when you want to do differential testing through linear models and have a lot of phenotypical associated variables?

thanks!

R PCA confounding batch regression • 548 views
1
Entering edit mode

Search for feature selection in machine learning. Some approaches such as lasso or tree-based methods (e.g. random forest, xgboost) output a variable importance. Another popular approach is recursive feature selection.

0
Entering edit mode

Thanks! I thought feature selection methods were generally applied to choose/collapse features that make up the info in the data (e.g. genes), so my issue may be a bit different: I'm talking about having two dataframes/matrices: one "A" with the data/features (the gene measurements), and another "B" with other, varied, covariates (e.g. age, sex, batch...), and the goal is to perform differential testing (not even building predictive models) in the "A" data frame, which contains the features "of interest". I could combine the "A" and "B" dataframes to perform feature selection across everything but that would be if my goal were to build a "predictive" model using some genes + the other covariates which best separate some groups; but I just want to use the genes for testing between conditions.

(although I have little experience in ML and may have misunderstood your suggestions)

0
Entering edit mode

I am not sure what you're trying to achieve. Since you mention linear models, you could also compute different models and select one based on an information criterion. This previous post may also be of interest.

0
Entering edit mode

OK, I'll start from there, thanks!

2
Entering edit mode
10 weeks ago
Papyrus ★ 1.3k

I'm updating this because I came upon a nice R/Bioconductor package specifically dedicated to this issue: variancePartition. It is designed to facilitate the exploration of how covariates in a experiment explain variation in the data.

1
Entering edit mode
10 weeks ago
Martombo ★ 2.9k

I especially like the SVA package for this kind of analysis. It is able to identify co-variates in the gene expression matrix while preserving the variation of the comparison you are focused on. Use the svaseq function for RNA-seq data, which returns a list of co-variates ranked by significance. You can then choose a subset or use them all to correct your linear model or to remove their associated variation.

1
Entering edit mode

Yes, I agree that the SVA approach is a great tool for identifying (and correcting for) latent sources of variation. Moreover, identified SVs could probably be also input into variancePartition to explore how they explain variation in comparison to known covariates, and that would surely be of interest!

Nonetheless, this post was more addressed to the more general, "unsupervised", exploration of the data. I've used SVA to great results, but one could argue that sometimes "protecting" the comparison/phenotype of interest is a bit "supervised" in the sense that you're intentionally avoiding variables with some correlation to the phenotype of interest. Sometimes one may want to include known and measured covariates if their effect is clear, even at the cost of losing some of the biological signal because of them being somewhat associated to the phenotype of interest.