6 months ago by
University College London Cancer Institute
My initial question back to you would be from where did you recruit the controls?; and to which tissues are we referring here? Tissues like blood serum/plasma will show a wider variation than others, particularly for metabolomics.
I've been working on metabolomics for the past year in the USA and took a lot of time to specifically look at the control samples that we had over there. They exhibit very high variability, as does everything in metabolomics!, but, actually once you normalise their metabolite levels (from m/z ratios), profiles of even different groups of healthy controls (processed in the same way but in different batches) actually match very well when looking at natural log counts (and after removing metabolites by the criteria that I mention below). What I'm comparing here are 2 distributions (one in red, the other blue) on the 2 batches of 15 randomly selected controls:
Natural log histogram
Natural log line plot
The distribution then gets a bit out of control if you further convert these to Z-scores:
Getting back to the main point. We did not [edit:] re-do the pre-processing / normalisation of the test sample metabolite levels based on the QC sample levels. The QC samples were purely used for identifying problematic metabolites, which we then filtered out of the main data. We specifically applied the following filtering criteria:
Remove metabolites if:
- Level in QC samples had coefficient of variation (CoV) > 25%
- Levels in QC samples had intraclass correlation (ICC) > 0.4
- Missingness > 10% across cases and controls
- No variability across cases and controls based on interquartile range
Then individual samples were removed if >10% of their metabolites had missingness
For everything else that remained, we converted NAs to half the lowest level, to zero, or imputed with the median level (of each metabolite), depending on the type of downstream analyses.
After all of that, your aim should be to get the levels in your cases and control in a normalised distribution and then conduct the differential analysis. I generally found that logging and then conversion to Z scores worked, followed by independent regression modelling predicting case/control status on a per-metabolite basis. We did not actually use XCMS.
Hope that this helps!