I have rnaseq data generated from brain tissues of 100 individuals, half of them were affected by a disease. All the samples were mix of two different brain regions, obtained from 4 different providers. I am trying to run DESeq2 to identify differentially expressed genes using the model like: ~age + sex + tissue + PMI + condition, where PMI is post-mortem interval. My colData looks like this:
condition age sex tissue PMI RIN brain_bank A016-93 disease 79 M FCBA89 48.00 6.400000 LND A032-95 disease 71 F FCBA89 65.00 6.550000 LND A046-94 disease 70 F FCBA89 100.00 5.400000 LND AN00216 disease 63 M FCBA89 26.16 6.150000 KBC AN02738 disease 85 F FCBA89 12.80 6.200000 KBC SD001-07 disease 37 F IF 46.00 6.200000 MII . . . AN03019 control 47 F FCBA89 31.85 5.300000 KBC SD004-10 control 50 M IF 63.00 5.400000 MII A046-91 control 41 M IF 49.00 5.450000 LND
Here condition, sex, tissue, brain_bank are as factors and age, RIN, PMI are numeric variables. The RIN values are averaged values because all the samples were sequenced in duplicates.
I want to include brain_bank as a covariate in the model but when I this: ~age + sex + tissue + brain_bank + PMI + condition, it gives me the following error:
the design formula contains a numeric variable with integer values, specifying a model with increasing fold change for higher values. did you mean for this to be a factor? if so, first convert this variable to a factor using the factor() function Show Traceback Rerun with Debug Error in checkFullRank(modelMatrix) : the model matrix is not full rank, so the model cannot be fit as specified. One or more variables or interaction terms in the design formula are linear combinations of the others and must be removed.
I have seen this error several times but being novice in statistics I am never able to understand this error. If anyone can suggest a good reading material about it, would be great.
Anyways following are my questions:
1) Is there some way to incorporate brain_bank as a covariate in the analysis or I should simply go for the first model?
2) Should I use mean RIN values in the model to account for the effect? Any other suggestions?
3) Not sure if this should asked as a separate question. I also want to run DEXseq in order to find diff. exon usage between cases and controls. So can I use the following model:
~sample + exon + exon:condition + age:exon + sex:exon + tissue:exon + PMI:exon ## Full formula ~sample + exon + age:exon + sex:exon + tissue:exon + PMI:exon ## Reduced formula
Thanks in advance for helping.