Question

Covariate Selection For Microarray Data

0

Entering edit mode

10.7 years ago

aditi.qamra ▴ 270

Hi,

can anybody suggest good R packages for variable selection for building linear models for analysing microarray data ? Literature suggests Bayesian variable selection/ Random forest etc. but I still wanted an opinion from experienced folks.

Essentially i wish to determine DEG between 2 groups in diseased samples. I also have expression value from paired normal samples as well. But even after thorough preprocessing of the data - there seems to be a lot of noise and I am certain it is because of the associated covariates like age, disease stage, disease class, presence of infection etc.

My problem is thus of a multiple regression model for each gene for each group in the diseased sample than just a 2*2 factorial design ( disease group and disease/normals )

I want to be able to select the variables that have the most impact on gene exp of diseased samples and then include them as covariates in my final model rather than use backward/forward elimination.

Thanks !

r microarray • 3.1k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 10.7 years ago by aditi.qamra ▴ 270

Ram · Accepted Answer · 2013-09-06

1

Entering edit mode

10.7 years ago

Sean Davis 26k

The problem, of course, is that you are looking for effects for each gene independently. A model that fits one gene well will not fit another. In practice, a thorough unsupervised analysis and subset supervised analyses may give you a sense of the important covariates (in terms of effects on gene expression). A more structured approach is to use something like SVA to define the latent variables apparent in the data.

ADD COMMENT • link 10.7 years ago by Sean Davis 26k

0

Entering edit mode

Thank you. that was a very helpful lead. So am I right to understand that SVA would then help find the DEG between group1 and group2 irrespective of all the other biological factors such as disease stage, type, age of patient etc.?

What I finally want is a list of DEGs in the following contrast - (Group1.Diseased- Group1.Normals) - (Group2.Diseased - Group2.Normal) just after making sure that none of the factors such as disease stage, type blah blah blah are masking the true difference. Right now I barely get a handful of genes from 25K+ if I only use the above contrast matrix

ADD REPLY • link 10.7 years ago by aditi.qamra ▴ 270

0

Entering edit mode

If your sample size is large enough, you can just include your covariates in a linear model (or GLM). Limma (or DESeq2 or edgeR for RNA-seq) will allow such models and you can then use your contrast of interest to find DEG "controlling" for the covariates.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Sean Davis 26k