Hello, I am looking for some guidance on metabolomics analysis. I have a data set that has missing values for many of the metabolites, and I am wondering what the best method for dealing with that is. If I limit myself to metabolites with no missing values I drop a huge amount of my data, but I also don't want to just fill in missing values with 0. So any recommendations are welcome.
Where I was based in the USA, metabolomics was being performed by many groups and different strategies were employed for this purpose - it provoked frequent discussion. Firstly, I'll say that if a metabolite has a high level of missingness, then you should probably remove it. For example, we removed ones whose missingness across samples was > 10%. Other strategies:
- impute with the median peak intensity for the metabolite prior to any other transformation (fine for univariate analysis)
- impute with half the lowest peak value prior to any other transformation
- replace missings with zero
Note that if your data is on the Z scale, then replacing missings with zero is actually equivalent to imputing with the median. The scale on which you have your data is important in relation to imputation.
What does my experience tell me? - it doesn't really matter how you do it. The key metabolites will always come up. Also, in metabolomics, 'missingness' can be caused by different things:
- simply not present
- failed instrument QC / remained in the background noise
- metabolite decayed into some other unknown metabolite and the signal is thus lost forever in that sample
- et cetera