First of all, I would like to inform you that I'm new in RNA-seq analysis and the DEseq2 package. Also, I have (very) basic knowledge in statistic, so my apologies if I'm asking naive questions :)
We would like to analyse different cell population that we isolated from different samples/environnement (blood, ascites, tumor) from different patients. RNA-sequencing was done in bulk. Because these data were generated in the context of a collaboration between several research groups, all the cells were not isolated from the same lab. I would like to test this parameter of course.
The idea in my design is the following: because I expect difference between cell type (of course) and conditions (the environnement), I've created a new column in my annotation object, which combine (paste0) the column cell_type and condition. In brief, I will consider "gMDSC from blood" as a different cell population than "gMDSC from ascites".
Here's a exemple of my annotation df
cell_type cond origin group CA.gMDSC.Blood gMDSC Cancer_Blood 1 gMDSC_Cancer_Blood DE.gMDSC.Ascites gMDSC Ascites 1 gMDSC_Ascites DE.gMDSC.Blood gMDSC Cancer_Blood 1 gMDSC_Cancer_Blood DO.gMDSC.Blood gMDSC Cancer_Blood 1 gMDSC_Cancer_Blood FR.gMDSC.Ascites gMDSC Ascites 1 gMDSC_Ascites FR.gMDSC.Blood gMDSC Cancer_Blood 1 gMDSC_Cancer_Blood FR.gMDSC.Spleen gMDSC Cancer_Spleen 1 gMDSC_Cancer_Spleen KD.gMDSC.Ascites gMDSC Ascites 1 gMDSC_Ascites KD.gMDSC.Blood gMDSC Cancer_Blood 1 gMDSC_Cancer_Blood NO.gMDSC.Ascites gMDSC Ascites 1 gMDSC_Ascites NO.gMDSC.Tumor gMDSC Tumor 1 gMDSC_Tumor ON.gMDSC.Blood gMDSC Cancer_Blood 1 gMDSC_Cancer_Blood ON.gMDSC.Tumor gMDSC Tumor 1 gMDSC_Tumor RE.gMDSC.Blood gMDSC Cancer_Blood 1 gMDSC_Cancer_Blood RE.gMDSC.Tumor gMDSC Tumor 1 gMDSC_Tumor RI.gMDSC.Blood gMDSC Cancer_Blood 1 gMDSC_Cancer_Blood RI.gMDSC.Tumor gMDSC Tumor 1 gMDSC_Tumor SH.gMDSC.Tumor gMDSC Tumor 1 gMDSC_Tumor TI.gMDSC.Ascites gMDSC Ascites 1 gMDSC_Ascites TI.gMDSC.Tumor gMDSC Tumor 1 gMDSC_Tumor A01.gMDSC gMDSC Ascites 2 gMDSC_Ascites A03.gMDSC gMDSC Ascites 2 gMDSC_Ascites
. . .
With sample names put as rownames. 1, 2, 3 and 4 are the 4 levels of my "origin" factor, and correspond to the different research group that isolated the cells
The way I understood the Deseq2 design formula, is "you choose the factor you want to use for comparaison in your analysis (the last factor), while puting the factors you want to "control" first. I guess control here mean "taking into account the variability due to this factor while analysing DEG for the factor of interest".
Here was my formula:
dds <- DESeqDataSetFromMatrix(countData = cnt, colData = annot, design = ~ origin + group)
Unfortunately, I got this error message:
"Error in checkFullRank(modelMatrix) : the model matrix is not full rank, so the model cannot be fit as specified. One or more variables or interaction terms in the design formula are linear combinations of the others and must be removed. Please read the vignette section 'Model matrix not full rank': vignette('DESeq2')"
If I remove the "origin" in my design formula, the script runs fine. But I feel that I miss something quite important there.
So I'm quite lost here...Am I going in the good direction for this kind of analysis (compairing cell population) or am I completely wrong?
Thanks in advance for your help, and sorry if I forgot to put some important information in the thread, but do not hesitate to ask them :)
Your origin column appears to encode the same info as the group column, doesn't it?
Sory, ignore that comment, I was confused by the alignment in your data frame
Is there a level of "group" that all come from a single origin, or a research centre that only provided samples of a single type?
Hi, thank you for your time
No, for each level of "origin", there are at least two level of "group" :)