My lab has differentiated iPSCs cell lines and I need to do a bioinformatic analysis to try to understand how close they are from the real organ made of these cells. To do this I've gathered publicly available RNA-seq from this tissue both fetal and adult. I'm now doing some exploratory analysis with PCA and dendograms but I'm struggling with some issues.
- Constructing the DESeqDataSet
My coldata table (reduced in samples size) looks like this
ID Study Stage Sample1 Lab IPSCs Sample2 Lab IPSCs Sample3 Lab IPSCs Sample4 Study1 Fetal Sample5 Study1 Fetal Sample6 Study2 Fetal Sample7 Study2 Fetal Sample8 Study2 Fetal Sample9 Study2 Fetal Sample10 Study2 Fetal Sample11 Study2 Fetal Sample12 Study3 Adult Sample13 Study3 Adult Sample14 Study3 Adult Sample15 Study4 Adult Sample16 Study4 Adult Sample17 Study4 Adult Sample18 Study4 Adult Sample19 Study4 Adult Sample20 Study4 Adult
I thought the best approach would be to test for Stage (iPSCs/fetal/adult) controlling for the effect of Study (the fact that the samples come from different experiments)
dds <- DESeqDataSetFromMatrix(countdata, coldata, design= ~ Study + Stage)
However, I always come across (after trying and retrying in different ways) with the error:
Error in checkFullRank(modelMatrix): the model matrix is not full rank, so the model cannot be fit as specified. One or more variables or interaction terms in the design formula are linear combinations of the others and must be removed.
I don't see any linear combination. I thought the problem was our own samples (since they are the only iPSCs), however I've tried performing this without them and the error persists. So I think I'm really missing something. I've read potentially every single post from people facing the same problem, also read the tools' vignette, but still can't figure out what's wrong with my design and how can I solve the issue.
- Accounting the batch effects for visualization purposes
I'm guessing limma::removeBatchEffects for Study would do the trick but would appreciate any hint on this topic too. Should I try sva and see if there is any other batch effects I should take into account?
I'm really sorry if these are really basic questions. I'm very new to bioinformatics and have been trying to find my way out on my own but sometimes I get stucked, specially with stuff involving statistics because of my lack of foundation on this.
I really appreciate any input you can give me. Thank you so much in advance!