Question

RNA seq help: limma remove batch effect without full rank design

0

Entering edit mode

5.6 years ago

jhasselm • 0

Hello Biostars,

I have been learning RNA seq analysis on my own but my lack of experience is catching up with me. I have run into a problem that I haven't been able to reconcile and I am hoping the community can offer me some insight. But before I get to the question, here is some background.

My lab has come up with a novel cell transplantation paradigm (into mouse brain) and I am working on comparing our transplanted cells to their cultured stem-cell progenitors, endogenous in-vivo human samples, and cultured human samples. However, I need human brain samples to get the in vivo controls and the samples are not easy to come by. So far, I have been able to sequence two human samples myself, but I have recently turned to comparing to a previously generated data set (done by a different lab) to bolster that number. This other data set also contained the cultured human samples, which is useful for us to compare to, but not absolutely necessary. Here is a simplified version of the design:

I know it is not a great setup but it is what I have to work with for the time being since no one else has data on what we are doing. Anyways, all of the Fastq files have been pre-processed/aligned identically and I am using DESeq2 with tximport (Kallisto summarized to gene level) followed by the variance stabilizing transformation to normalize everything. This results in my in house samples not lining up with the previously published human samples by PCA, suggesting a batch effect.

To correct this, one of our collaborators recommended the following solution:

dds <- DESeqDataSetFromTximport(txi, colData=samples, design=~Lab + Cell_Type)
dds <- DESeq(dds)
vst <- varianceStabilizingTransformation(dds, blind=TRUE)
assay(vst) <- limma::removeBatchEffect(assay(vst), vst$Lab)
plotPCA(vst, intGroup="Cell_Type", "Lab")

Resulting in the following PCA:

While this definitely moves things around and shifts some of the PC2 variance to PC1 (which is good for our argument), I can't help but feel like this is an inappropriate application of the batch correction, considering that the experimental design is far from being "full rank." Which brings me to my actual questions:

1) Is this attempt at batch correction appropriate?

2) If not, is there a better method or is waiting for more human samples to run myself the only valid option?

I have been scouring the forums (mainly via Google searches) and have been unable to find an answer that directly applies to this situation. That being said, I apologize if this question ends up being a duplicate because I overlooked something. And thank you all in advance for any answers and retroactively for all of the other posts I have read which have gotten me this far.

Gratefully,

Jonny

RNA-Seq limma batch-effect r • 3.4k views

ADD COMMENT • link updated 5.4 years ago by zx8754 11k • written 5.6 years ago by jhasselm • 0

0

Entering edit mode

"1) Is this attempt at batch correction appropriate?

Perhaps yes to be on the safer side..you can use SVA

ADD REPLY • link 5.6 years ago by 1769mkc ★ 1.2k

score 0 · Answer 1 · 2018-10-03

0

Entering edit mode

5.6 years ago

Kevin Blighe 87k

Jonny, did you happen to read my answer here? - Batch effects : ComBat or removebatcheffects (limma package) ?

You could equally try vst <- varianceStabilizingTransformation(dds, blind=FALSE) and see how that affects the resulting PCA bi-plot. Read in my other answer the difference between blind = TRUE / FALSE

Kevin

ADD COMMENT • link 5.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin,

Thanks for the response! I have read that post and I was in the process of reading that article but it fell by the wayside as I went deeper into the rabbit hole that is Google. I finished reading it and the main takeaway for my situation seems to be that, with such an unbalanced design (the cultured groups are not directly comparable so we have 4 groups total with only one overlapping), batch correction will only partially remove the batch effect and has the potential to introduce false differences. In the case of PCA, this could increase variance between some clusters thereby pushing other clusters closer together, right? I fear this is happening in our plot which is falsely making our transplants cluster closer to their in vivo counterparts.

As for using blind=TRUE vs blind=FALSE, I have tried both methods with little change to the batch effect. I believe you have the code posted for the bi-plots on here somewhere but I don't currently have the snippet in my script so I am including PC1 vs PC2 plots for both. I hope these are sufficient.

I'm not sure how to use this information to decide how to best move forward so I'm open to suggestions.

ADD REPLY • link 5.6 years ago by jhasselm • 0