I understand that an important aspect of eQTL analysis is accounting for confounding variation in the expression data, as many factors can affect gene expression, including sex, age, lifestyle factors etc. Including these factors as covariates in the model therefore increases the chances of identifying genuine eQTL effects.
I have RNASeq data for approximately 100 individuals, along with corresponding genotype data for approx 600,000 SNPs. I'd like to use MatrixeQTL to do an eQTL analysis using this data. I'm also in the fortunate position of having a lot of clinical data for these subjects, describing a variety of things, including the usual suspects: age, gender, ethnicity etc. All in all the clinical data comprises 200 fields. However, as is usually the case with clinical data, much of it is incomplete (i.e. contains NAs).
Any number of these fields could be influencing gene expression, but i'm unsure how best to account for them in the analysis. Do I include all them as covariates? Only the "obvious" ones (age, sex etc)? Or is a better approach to use principle components of the expression data as covariates?
Any guidance would be appreciated.