Which covariates and factors to include in model matrix
2
0
Entering edit mode
2.5 years ago
Barista ▴ 40

Hi all,

I am running a human whole blood transciptomics experiment where two groups of patients are compared. The two groups consist of patients with two different disease. The goal of the experiment is to get insight in the different profiles of gene expression between the two groups. Now I have read the EdgeR manual and other documentation online as on Pubmed, but I am still not sure how I can construct a valid design matrix.

Of course I have included the conditions in the design matrix, but I am wondering which other factors or covariates I should correct for. It seems to be that you should correct for factors/covariates that could confound gene expression. In online tutorials and articles these are mostly things as batch effects, specific time points or different treatment conditions. But the question I keep asking myself is, do I have to correct for covariates such as age, white blood cell differentiation, other laboratory measurments (such as NTproBNP, kidney function etc. etc.) or factors such as sex and comorbities (so for example hypertension, diabetes, COPD etc.).

I have constructed the model matrix with the model.matrix function such as:

design <- model.matrix(~ 0 + types, data = y$samples)

Also I would like to know if I have to include an intercept term when including a covariate, so for a basic example:

design <- model.matrix(~types + age + sex + COPD (and so on...), data = y$samples)

or is the right way:

design <- model.matrix(~0 +types + age, data = y$samples)
Model EdgeR matrix design • 3.3k views
ADD COMMENT
0
Entering edit mode

Any thoughts or comments?

ADD REPLY
4
Entering edit mode
2.4 years ago

You can start by doing PCA, and seeing if you can determine what variable drives the first 2 or 3 principal components. Your ability to account for more than that is going to depend on how many samples you have. If you have only 8, you probably won't be able to model anything more than disease status. With 100, you could probably include more things, like age, sex, other lab results. But of those other things aren't visible in the PCA, they won't matter much.

ADD COMMENT
0
Entering edit mode

Allright, this is some practical pragmatic advice that really helps, thank you for that! I have 539 samples and as by definition the most variation is explained by the first 2 PC's, in total this relates to 30%. So if I understand correctly I should "PCA" the factors/covariates that I think drive the expression and act as a "confounder". For factors this should be easy, just colour the PCA according to the groups of the factor, but how can I do this for continuous variables? Should I factor them by dividing the groups in 4 quantiles for example and then repeat the same process as for factors?

Also am I struggling whether I should correct for blood differentiation (neutrophils, basophils, lymphocytes etc.). This as for bulk RNA-seq it could be important to correct for the cell composition of the tissue that is samples, which in my case is whole blood. I already experimented with adding log(cellcounts +1) to the design matrix for all the cells and this reduced my DE genes from 1400 to around 50. How can I do the PCA trick for the cells counts? As said before, should I make groups based on quantiles of the log(cellcount+1) data and then make a PCA coloured based on which group (quantile) a sample belongs in?

Anyway, many thanks for your help already!

ADD REPLY
0
Entering edit mode

For continous variables you can plot a scatter of PC1 vs confounder, PC1 vs counfouder etc. Or you can use a continous colour scale - this would be automatic in ggplot if you included a continuous variable as a col asthetic. I'm sure you can do it in PCAtools etc.

ADD REPLY
0
Entering edit mode

Do not fart around with the counts that you give to DESeq. It wants raw counts, so stop logging them or adding pseudocounts.

ADD REPLY
2
Entering edit mode

I think he is talking about using log(cellcounts+1) - that is the histology of the sample (what % neurophils, what % basephils, what % lymphocytes etc) as a covariat in the model, which should in principle be fine, not passing it in as gene expression counts for the counts matrix.

ADD REPLY
0
Entering edit mode

Yes indeed, I meant including log(cellcounts+1) in the design matrix as covariate.

Allright I have some practical tools to start investigating which factors and which covariates could be of importance to include in the model, thanks a lot for helping me out on this one, this gives some direction on which way to go.

ADD REPLY
3
Entering edit mode
2.4 years ago

Its also difficult to decide what covariates to include and which not to. In general correcting uneccesarily for covariates costs you power becuase the model has more parameters to estimate. One might correct for covariates for two reasons:

1) Avoid false positives due to changes actaully being associated with a covariate, rather than the condition of interest 2) Avoid false negatives due to the uncorrected covariates increasing the heterogeneity.

It is also unlikely that uncorrected covariates will cause false positives as long as the ages/co-morbilities are balanced between the disease conditions. If they are not, then you probably need to talk to an epidemiologist because I forsee that there are all sorts of issues to be addressed in such a design. In general, I don't think that normal frameworks we use for RNAseq DE are well suited for causal inference. That is, they are good as saying "genes different between these two groups of samples that are from different conditions", but less good at saying "gene differ between these two groups of samples BECAUSE they are from different conditions". My understanding is that they were designed with experimental data (where most confounders are balanced between conditions via experimental design), rather than observational data (where they are not) in mind. Related to that is the fact that the models we use are generally "fixed-effects" models, but I think (although I'm sort of straying beyond my statistical comfort zone) that things observed after the fact are "random-effects". Limma can handle random effects to some extent via its duplicateCorrelation function, but using that in this case would be beyond me.

In the case of avoiding false negatives, you'd want to know that the covariates in question really were inducing extra hetrogenity before correcting for them, because as I said before, correcting for unneccessary covariates reduces power, while correcting for the right ones increases power. You might like to look at something like a PCA, or and MDS or heirarchical clustering of the data to see if those covariates seem to be driving sample sepeartion. If they are, thats a good arguement for correcting for them. If they are not, it might reduce rather than increase power.

ADD COMMENT
0
Entering edit mode

Oh, and on the "include the interscept or not" question - it doesn't really matter, it just changes the way you extract the contasts of interest at the end.

ADD REPLY
0
Entering edit mode

Dear Sudbery, thanks a lot for a bit of background information why to correct or not to correct. I could imagine that my design is indeed not 'balancing out' the age/comorbidities. To give some more background info, my cohort consists of 539 patients (divided into two groups based on a condition) selected from a larger prospective observational cohort. The patients were not matched on age, sex or comorbidites. Furthermore I am doing bulkRNAseq of whole blood, so as also said in the comment to swbarnes I am inclined to correct for cell differentiation. As this is such a heterogeneous cohort doing PCA or heatmaps gives no clustering at all based on the condition that is used to divide the groups.

ADD REPLY
0
Entering edit mode

At this point, given the size of your dataset and the likely cost of the experiment you've just described, I really would recommend consulting with a proper statistician or epidemiologist. I'd worry about inducing things like selection bias and collider bias. I'm pretty confident that thigns like cell type counts should be included as random effects rather than fixed effects. I think limma can do this, but you'll probably need Gordon Smyth (edgeR author) to explain how. I think he is pretty good at getting to thing pretty quickly on the bioconductor support site. He does come round here on occassion and will probably pick up this question at some point, but I think he is on support.bioconductor more often.

ADD REPLY
0
Entering edit mode

I totally agree with you that a statistician or epidemiologist or bioinformatician should be included. To give some context, I am a medicine student which started a PhD in the beginning of this year and my supervisor with bioinformatics knowledge decided to quit a couple of months ago. We currently are looking for a substitute supervisor, but this is work in progress. And as some progress on the subject is expected I am trying one way or the other to figure out how to do thing myself, so as said, a lot of thanks for your feedback and I will try to consult the bioconductor forum as well!

ADD REPLY
0
Entering edit mode

Just to be sure - a statistician/epi is not the same as a bioinfomatician. I'm a bioinformatician, and I know enough stats to get by on a day to day basis, but also enough to know when to contanct a real statistician.

ADD REPLY
0
Entering edit mode

i.sudbery, I can't stress it enough, many, many thanks for your help. Knowing which people to reach out to is of great value!

ADD REPLY

Login before adding your answer.

Traffic: 1604 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6