Question

RNA-Seq data, batch effect source

0

Entering edit mode

6.7 years ago

lu.ne ▴ 70

Hi All,

I am currently trying to normalise some RNA-Seq data. Indeed, samples came from two different batches and when plotting the values using a PCA plot the separation is clearly marked.

It seems that the protocol used for samples processing is the same as well as the lab where the analyses were performed.

I tried several normalisations including limma's removeBatchEffect function, housekeeping genes normalisation using RUVg or adding the batch effect as part of a model but either the separation is still here (using removeBatchEffect) or it seems completely random (moreover the idea of using housekeeping genes for normalisation seems quite a subject of controverse).

Before trying any more things to normalise this dataset I would like to know where it comes from (or at least determine if it is possible to identify the reason or not) in order to select the best normalisation methods. To do so I fitted a model (using limma in R), used the batch effect as a control/treatment situation and extracted significant GO terms related to the difference between batches (using the gage function). I obtained terms related to either antigenes or viral processes.

I have two questions, does this result mean anything in this situation, could it point to a specific issue? and, is this a suitable method to identify the source of difference?

Thank you,

RNA-Seq • 2.1k views

ADD COMMENT • link updated 3.1 years ago by Biostar 20 • written 6.7 years ago by lu.ne ▴ 70

score 5 · Accepted Answer · 2017-08-10

Don't bother trying to interpret GO results from something like this, you can get a batch effect just from preparing the same thing on different days if you let the tubes warm up a bit more/less between the days. If you want to check if the batch is being driven by a couple genes (so you can exclude them) then just look at the projections from prcomp() in R. More likely than not, you have a bunch of genes all contributing a little to this, since what you're seeing is some combination of length and GC bias between the batches (plus other things, likes). You might have a look at the CQN package if this ends up being GC-bias based.