I have been learning RNA seq analysis on my own but my lack of experience is catching up with me. I have run into a problem that I haven't been able to reconcile and I am hoping the community can offer me some insight. But before I get to the question, here is some background.
My lab has come up with a novel cell transplantation paradigm (into mouse brain) and I am working on comparing our transplanted cells to their cultured stem-cell progenitors, endogenous in-vivo human samples, and cultured human samples. However, I need human brain samples to get the in vivo controls and the samples are not easy to come by. So far, I have been able to sequence two human samples myself, but I have recently turned to comparing to a previously generated data set (done by a different lab) to bolster that number. This other data set also contained the cultured human samples, which is useful for us to compare to, but not absolutely necessary. Here is a simplified version of the design:
I know it is not a great setup but it is what I have to work with for the time being since no one else has data on what we are doing. Anyways, all of the Fastq files have been pre-processed/aligned identically and I am using DESeq2 with tximport (Kallisto summarized to gene level) followed by the variance stabilizing transformation to normalize everything. This results in my in house samples not lining up with the previously published human samples by PCA, suggesting a batch effect.
To correct this, one of our collaborators recommended the following solution:
dds <- DESeqDataSetFromTximport(txi, colData=samples, design=~Lab + Cell_Type) dds <- DESeq(dds) vst <- varianceStabilizingTransformation(dds, blind=TRUE) assay(vst) <- limma::removeBatchEffect(assay(vst), vst$Lab) plotPCA(vst, intGroup="Cell_Type", "Lab")
Resulting in the following PCA:
While this definitely moves things around and shifts some of the PC2 variance to PC1 (which is good for our argument), I can't help but feel like this is an inappropriate application of the batch correction, considering that the experimental design is far from being "full rank." Which brings me to my actual questions:
1) Is this attempt at batch correction appropriate?
2) If not, is there a better method or is waiting for more human samples to run myself the only valid option?
I have been scouring the forums (mainly via Google searches) and have been unable to find an answer that directly applies to this situation. That being said, I apologize if this question ends up being a duplicate because I overlooked something. And thank you all in advance for any answers and retroactively for all of the other posts I have read which have gotten me this far.