I have some RNAseq data and I would like to take the results forward to build a network based on expression correlation (wgcna or similar).
I have several steps that I need to carry out:
- normalise data
- transform data
- subset data (not everything in the dataset is useful for the intended outcome so I need to extract only the useful samples)
- remove a known batch effect (I know I can model this for differential expression, but for building a network I think I need to remove it - correct me if I am wrong)
I am using DESeq2 for normalisation and transforming using the variance stabilising transform from the same package (as recommend in the wgcna manual).
I have noticed that the outputs of my exploratory analyses change depending on the order in which I carry out these steps, particularly PCAs. In most cases, the gross patterns in the data remain intact, but in some cases this is not true. My question is, what is the correct order in which to carry out these steps and why?
Currently, I am loading all of the available samples, normalising, transforming the normalised counts, removing the batch effect from the transformed data, then extracting the samples of interest. General discussion about the order in which these processes should be carried out is welcome, but specifically:
- Would it be more sensible to extract the samples I am interested in first, and then run the downstream steps only on the samples I am interested in? I imagine this would affect the output as the geometric mean across the samples would change.
- I am currently removing the batch effect from the vst-transformed data. Would I be better to remove the batch effect first from the normalised counts and then transform the batch-corrected data?
Thanks for your help