First, I apologize if my question seems duplicated, I've extensively searched and read the previously asked questions, but different and sometimes contradicting opinions made it hard for me to reach a final conclusion.
My experiment objective is to generate a list of Differentially expressed genes between tumoral cells and their healthy counterparts for subsequent analysis. Based on what I have learned so far, I have this analysis pipeline in mind:
- Collect raw (.CEL) data of different experiments "from the same platform" (HG-U133_Plus_2)
- Quality control, preprocess and normalize samples within each experiment separately.
- Combine all of the "normalized" samples into a single dataset, but keep the batch effect in mind (and use combat or just use their original experiment set name as a covariant while analyzing with limma.)
- Perform Differential gene expression on the combined dataset.
Is this approach valid? Or should I first combine all of the samples from every experiment into one dataset, and then normalize them together in a single step?
Thank you for your time.
Regards.
Thank you for your help. : )
Cool. Generally, you seem to be aware of where the pitfalls may lie with such a procedure, so, that is a good start. I'm only adding this further comment for others who may arrive at this thread:
After you normalise everything together, a check of the box-and-whisker plot will be immediately informative: if the experiments are grossly different, then this should be visible on such a plot as the different experiments will likely not line up at their median, even after quantile normalisation. For example, if one is brain tissue while the other is skin, then these will have gossly different expression profiles and it would obviously be more appropriate to analyse them separately, even if they are the same chip type.
Thank you for your comment. I've actually learned a lot by reading your posts. One question that crosses my mind is that because most normalization methods (for example RMA) shares information between arrays, would it not be safer if I normalize the data form each experiment separately?
To explain myself better, there are always non-biological variations between samples which we try to "minimize" using normalization. Because arrays from different experiments (labs) have more differences related to non-biological variations between themselves, can't we expect that normalizing all of them together would be an oversimplification and could lead to a loss of biological variations in the process? Can we consider "normalizing each experiment samples separately" as a more conservative approach?
Yes, this is why I also said this: "As you go through the analysis, you will learn which procedure is best."
Questions like yours have no real answer... they are each going to require a different approach based on many factors, all of which cannot be properly defined in a single 'catch all' answer.
I see, thank you again for your time. : )