In my bulk RNA-seq dataset, I noticed a few outliers on MDS plot and on further investigation found that these samples originated from one extraction batch (extbatch) 7.
Fig 1: MDS plot showing outliers and that they belong to extbatch 7.
Most of the samples from batch 7 and even batch 8 had lower RNA yield than other samples.
Fig 2: Mean and SD of RNA yield over extraction batches.
So now i am trying to correct this. I tried
sva::ComBat() to correct for this extbatch and I ended up with negative values in my data. For comparison, I tried correction using
scran::mnnCorrect(). This also produced negative values.
scran::mnnCorrect(clist[],clist[],clist[],clist[],k=4) sva::ComBat(dat=data,batch=extbatch,mod=model.matrix(~family+condition,meta), par.prior=TRUE,prior.plot=TRUE)
Fig 3: Distribution of raw counts, counts after scran correction and combat correction. X-axis are individual samples. Y-axis are counts forced to limits -100K and 100K. Colours denote extraction batches. Red is the offending batch 7 with mostly bad samples. See appearance of negative values after batch correction.
Now, there is the argument that rather than correcting the batch, include it as a factor in my GLM model (
~extbatch+family+condition). I did that and found 2 DEGs. If I remove samples from this batch 7 from my data altogether and run my model (
~family+condition), I find many DEGs.
So, some of my questions are
- Why do I get negative values and how do I fix that?
- Is it better to discard these offending samples (I lose half my data) rather than fixing it using batch correction.
- Why is the GLM with extbatch failing to produce any DEGs while removing the batch works?
- If anyone has any better suggestions to handle this?