Question: Preprocess QCs in metabolomics workflow together with samples - yes or no?
0
gravatar for ab123
21 days ago by
ab12320
London
ab12320 wrote:

This is more of a conceptual question.

Let's assume I have 5 QCs and 10 samples (2 groups - Control and Case).

I preprocess them all together in XCMS following the usual steps in order to then define a coefficient of variation across the QCs and throw out features that do not meet a certain threshold here (let's say 30%).

Going forward into differential expression now: would you redo preprocessing without the QCs for the 10 samples or would you carry on with the samples and their intensities as determined during that first preprocessing step (i.e. containing the QCs)?

Naturally, I get different intensities when I do preprocess with or without QCs. And to me it makes more sense to not factor the QCs into preprocessing since they would affect my samples' intensities? However, the latter affects which peaks are picked and of course they too vary...

Any takes on that?

Much appreciated! Cheers

xcms R metabolomics • 208 views
ADD COMMENTlink modified 21 days ago by Kevin Blighe8.9k • written 21 days ago by ab12320
1
gravatar for Kevin Blighe
21 days ago by
Kevin Blighe8.9k
Europe/Americas
Kevin Blighe8.9k wrote:

Hey ab123,

My initial question back to you would be from where did you recruit the controls?; and to which tissues are we referring here? Tissues like blood serum/plasma will show a wider variation than others, particularly for metabolomics.

I've been working on metabolomics for the past year in the USA and took a lot of time to specifically look at the control samples that we had over there. They exhibit very high variability, as does everything in metabolomics!, but, actually once you normalise their metabolite levels (from m/z ratios), profiles of even different groups of healthy controls (processed in the same way but in different batches) actually match very well when looking at natural log counts (and after removing metabolites by the criteria that I mention below). What I'm comparing here are 2 distributions (one in red, the other blue) on the 2 batches of 15 randomly selected controls:

Natural log histogram

histloge

Natural log line plot

densLoge

The distribution then gets a bit out of control if you further convert these to Z-scores:

denslogez


Getting back to the main point. We did not [edit:] re-do the pre-processing / normalisation of the test sample metabolite levels based on the QC sample levels. The QC samples were purely used for identifying problematic metabolites, which we then filtered out of the main data. We specifically applied the following filtering criteria:

Remove metabolites if:

  • Level in QC samples had coefficient of variation (CoV) > 25%
  • Levels in QC samples had intraclass correlation (ICC) > 0.4
  • Missingness > 10% across cases and controls
  • No variability across cases and controls based on interquartile range (IQR)

Then individual samples were removed if >10% of their metabolites had missingness

For everything else that remained, we converted NAs to half the lowest level, to zero, or imputed with the median level (of each metabolite), depending on the type of downstream analyses.

After all of that, your aim should be to get the levels in your cases and control in a normalised distribution and then conduct the differential analysis. I generally found that logging and then conversion to Z scores worked, followed by independent regression modelling predicting case/control status on a per-metabolite basis. We did not actually use XCMS.

Hope that this helps!

Kevin

ADD COMMENTlink modified 21 days ago • written 21 days ago by Kevin Blighe8.9k

Hi Kevin, thank you for the extensive and informative reply!

As per your question, we are looking at organ tissues here. The QCs are pooled from the samples. They cluster nicely in PCA suggesting small instrumental variation.

I'm still a bit confused as to the actual preprocessing step. Wouldn't I want my samples to be peak-aligned according to the QCs? If I preprocess all samples together, peak alignment etc is performed across 3 groups and the intensities reflect that. If I do it for just the samples, I end up with different features which makes it difficult to then filter the QC features against. I could technically subject both QCs and 2-group samples separate preprocessing. That leads to slightly different lists of metabolites. I could then try to filter QCs (CoV > 30%) against samples? But again, samples are then no longer aligned according to the features present in QCs.

Sorry, if that sounds confusing. I am confused right now. More so about the actual inputting steps I guess.

Your removal of metabolites processing looks sound, but how is it applied?

ADD REPLYlink modified 21 days ago • written 21 days ago by ab12320
1

Hi! - no problem. You mentioned in your original message about re-doing the pre-processing step after the initial filtering, which is something that we didn't do.

For us, all samples (QC, cases, controls) undergo together the initial pre-processing step for peak area identification, m/z ratio calculation, etc. (as you have done), and then we filter out metabolites/samples that meet the filtering criteria that I mentioned above. We then proceed with that same data for downstream testing (less the QC samples). There are no further pre-processing steps and the pre-processing step is not re-done.

The first 2 QC criteria:

  • Level in QC samples had coefficient of variation (CoV) > 25%
  • Levels in QC samples had intraclass correlation (ICC) > 0.4

Just calculate these using the QC samples. Any metabolites that meet these criteria, remove them from all cases and controls.

The other criteria:

  • Missingness > 10% across cases and controls
  • No variability across cases and controls based on interquartile range (IQR)

These are only applied to the cases and controls. The 10% cutoff is a bit meaningless in your data as you only have 10 samples (we had hundreds).

The final one: "Then individual samples were removed if >10% of their metabolites had missingness". If any sample has >10% of its metabolites with missing values, it should be removed from the dataset.

Hope that this clarifies it a bit?

Edit: it is interesting that you get different results when you re-do the pre-processing, but it's also expected based on the wide variation that metabolites exhibit. A lot of processing methods in this field are liable to change

ADD REPLYlink modified 21 days ago • written 21 days ago by Kevin Blighe8.9k

Awesome, thank you once again for the very detailed answer. The above definitely solves it for me! I still think it may an interesting question to see if preprocessing with the QCs biases the samples towards the QCs...

ADD REPLYlink modified 21 days ago • written 21 days ago by ab12320

Well, we should use the word 'solved' very lightly! With metabolomics, I think that it's open game with regard to how the data is processed. Your logic does make sense, i.e., yo go back and re-perform the pre-processing step with just the cases/controls (and after they've been filtered).

ADD REPLYlink written 21 days ago by Kevin Blighe8.9k
1

"Solves" my question.

ADD REPLYlink written 17 days ago by ab12320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 957 users visited in the last hour