RNA-Seq Batch correction negative values
1
1
Entering edit mode
5.2 years ago
firestar ★ 1.4k

In my bulk RNA-seq dataset, I noticed a few outliers on MDS plot and on further investigation found that these samples originated from one extraction batch (extbatch) 7.

Fig 1: MDS plot showing outliers and that they belong to extbatch 7.

Most of the samples from batch 7 and even batch 8 had lower RNA yield than other samples.

Fig 2: Mean and SD of RNA yield over extraction batches.

So now i am trying to correct this. I tried sva::ComBat() to correct for this extbatch and I ended up with negative values in my data. For comparison, I tried correction using scran::mnnCorrect(). This also produced negative values.

scran::mnnCorrect(clist[[1]],clist[[2]],clist[[3]],clist[[4]],k=4)
sva::ComBat(dat=data,batch=extbatch,mod=model.matrix(~family+condition,meta),
par.prior=TRUE,prior.plot=TRUE)


Fig 3: Distribution of raw counts, counts after scran correction and combat correction. X-axis are individual samples. Y-axis are counts forced to limits -100K and 100K. Colours denote extraction batches. Red is the offending batch 7 with mostly bad samples. See appearance of negative values after batch correction.

Now, there is the argument that rather than correcting the batch, include it as a factor in my GLM model (~extbatch+family+condition). I did that and found 2 DEGs. If I remove samples from this batch 7 from my data altogether and run my model (~family+condition), I find many DEGs.

So, some of my questions are

1. Why do I get negative values and how do I fix that?
2. Is it better to discard these offending samples (I lose half my data) rather than fixing it using batch correction.
3. Why is the GLM with extbatch failing to produce any DEGs while removing the batch works?
4. If anyone has any better suggestions to handle this?
rna-seq combat batch-effect scran sva • 4.6k views
2
Entering edit mode
5.2 years ago

Dear rmf,

My recommendation is to continue along the pursuit of correcting for batch by regarding it as a covariate in modelling. I have heard a few commentators call batch-correction tools like ComBat as too "aggressive" in how they modify your data in the pursuit of correcting for batch.

1. Why do I get negative values and how do I fix that?

ComBat converts low count values to negative values after it has modeled the batch effect - it is literally 'subtracting out' the modeled difference. Not all counts will become negative, of course. I have heard how some people then convert all negative values to 0, which seems reckless to me. I did that in the past for my own data and it distorts the data distribution. I would prefer to keep low counts in my data and not to have to lose them to correct for batch.

2. Is it better to discard these offending samples (I lose half my data) rather than fixing it using batch correction.

No, don't throw in the towel just yet. If you try various things and cannot correct for the batch effect sufficiently, then you could divide your data into 2 and use one for training whilst the other for validation

3. Why is the GLM with extbatch failing to produce any DEGs while removing the batch works?

The distribution of data that you're throwing into the GLM may not be what the GLM is expecting, thus, the modeling fails. Or, the GLM over-adjusts because the batch effect is huge, uneven, or confounding.

4. If anyone has any better suggestions to handle this?

In the past, I had success using DESeq2 and including batch as a factor likely to influence counts, i.e., design=~Batch+Gender In your case, you could possibly create another factor level for Low|Medium|High RNA yield.

Just my own thoughts - I'm confident that others will provide more

Kevin

0
Entering edit mode

Just to add a comment to my own answer: If you choose to go along with DESeq2, then its regularised log transformation (post normalisation) deals very well with low counts.

0
Entering edit mode

You may additionally want to take a look at this thread: Surrogate Variable Analysis for Complex Experimental Design

1
Entering edit mode

I am using DESeq2 although I haven't mentioned it in my post. The MDS plot shown is based on VST from DESeq2. Anyway I can't use VST or RLOG transformed data for DGE in DESeq2. So far, using my real experimental variables doesn't seem to help with the batch effect. Perhaps I should create some new variables (like rna yield h/m/l you mentioned). I tried sva and created a new surrogate variable, but that didn't help.

0
Entering edit mode

It looks like adding a factor for RNA yield >5 and <5 (2 levels) would assist, along with the factor for Ext Batch (4 levels). As you adjust more and more, though, you run the risk of getting criticised if you aimed to publish.

0
Entering edit mode

Thanks for the tips. Do you think it's possible to add rnayield as a continuous variable to the glm model rather than categorising it?

1
Entering edit mode

It's possible but, for DESeq2's model design when you're inputting the raw counts, only factors are allowed. If you supply a continuous variable, it will convert it to factors. If you're doing modelling later on using the normalised counts and glm() or lm(), then, by all means, a continuous variable is fine.

As you begin to normalise and adjust for covariates, I would just ask that you be very careful about the conclusions you draw from the results. If I was in the position, I would consult a Full Professor in Statistics (if possible) just for corroboration.

1
Entering edit mode

It's possible but, for DESeq2's model design when you're inputting the raw counts, only factors are allowed. If you supply a continuous variable, it will convert it to factors.

@Kevin, are you sure?? - As far as I recall, continuous variables are valid in DESeq2's GLM.

Just to tak onto the discussion, I agree that consulting with a statistician is always a good idea if you're unsure. Two pairs of eyes are better than one. An additional avenue of exploration is to voom transform your counts and use Limma's differential expression modelling approach with a contrast matrix.

I also second the notion not to use ComBat, as it's been shown to exaggerate effects of the conditions you supply in your design matrix.

1
Entering edit mode

Thanks for the catch, Andrew! It does appear to accept continuous variable. I'll also quote Michael Love: "If the variable of interest is continuous-valued, then the reported log2 fold change is per unit of change of that variable" (http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).

I was sure that in older versions it converted continuous variable to factors. I know that the developers have done an overhaul of the modelling system in DESeq2, though.