I am currently trying to normalise some RNA-Seq data. Indeed, samples came from two different batches and when plotting the values using a PCA plot the separation is clearly marked.
It seems that the protocol used for samples processing is the same as well as the lab where the analyses were performed.
I tried several normalisations including limma's removeBatchEffect function, housekeeping genes normalisation using RUVg or adding the batch effect as part of a model but either the separation is still here (using removeBatchEffect) or it seems completely random (moreover the idea of using housekeeping genes for normalisation seems quite a subject of controverse).
Before trying any more things to normalise this dataset I would like to know where it comes from (or at least determine if it is possible to identify the reason or not) in order to select the best normalisation methods. To do so I fitted a model (using limma in R), used the batch effect as a control/treatment situation and extracted significant GO terms related to the difference between batches (using the gage function). I obtained terms related to either antigenes or viral processes.
I have two questions, does this result mean anything in this situation, could it point to a specific issue? and, is this a suitable method to identify the source of difference?