Question

Mass-spectrometry proteomics

0

Entering edit mode

26 days ago

Nasim Gandomdoust • 0

I have mass-spectrometry-based proteome data of 6 control and 3 treated sample. There are random number of valid LFQ intensity per protein in each group. For example for a random protein 2 samples in control group and 1 sample in treated group have valid values. There are sometime more or less. There are cases also that per a specific protein, only one random sample from each group have valid value. And I am looking for differentially expressed proteins between control and treated. I don’t want to loose any of data. Could you please tell me what statistical method should I use for my analysis? How to transform and impute the data?

Proteomics analysis • 775 views

ADD COMMENT • link updated 14 days ago by Gordon Smyth ★ 8.2k • written 26 days ago by Nasim Gandomdoust • 0

score 2 · Accepted Answer · 2025-05-28

2

Entering edit mode

26 days ago

Gordon Smyth ★ 8.2k

One possibility in the limpa package, described here: https://www.biorxiv.org/content/10.1101/2025.04.28.651125v1

Ideally, you would give peptide-level data to limpa. If all you have are LFQ values, then you could try this:

library(limpa)
y.imputed <- dpcImpute(y)
design <- model.matrix(~Treatment)
fit <- dpcDE(y.imputed, design)
fit <- eBayes(fit, robust=TRUE)
topTable(fit, coef=2)

Here, y is a matrix or EList of normalized LFQ values on the log-scale, including NAs, and Treatment is a factor indicating which samples are treated and which are controls. y.imputed is an EList containing imputed values and also information about how reliable each imputed value is.

This approach will avoid losing any data and will, where possible, infer DE from the missing value pattern. A protein may come as DE even if it is entirely NA in one group, provided the LFQ values are sufficiently high in the other group.

The above code uses a default detection probability curve (DPC), but you should explore that by typing:

dpcfit <- dpc(y)
plotDPC(dpcfit)

ADD COMMENT • link 26 days ago by Gordon Smyth ★ 8.2k

1

Entering edit mode

For tl;dr, it's the limpa package https://bioconductor.org/packages/release/bioc/html/limpa.html

ADD REPLY • link 26 days ago by ATpoint 88k

0

Entering edit mode

Thank you for your reply. I would like to know if the package can handle batch effects and outlier samples too.

ADD REPLY • link 25 days ago by Nasim Gandomdoust • 0

1

Entering edit mode

limpa shares the full capabilities of the limma package, which includes adjustment for batch effects and outliers. To adjust for batch effects, include the relevant factors or covariates in the design matrix, as you would do for limma or edgeR. To detect and downweight outlier samples, include the argument sample.weights=TRUE in the call to dpcDE.

ADD REPLY • link 24 days ago by Gordon Smyth ★ 8.2k

0

Entering edit mode

Thank you for your answer. I conducted an analysis using limpa and found it quite interesting. As far as I understand, both ON and CN models are constructed per peptide (or per precursor) across multiple samples (please correct me if I’m wrong). Now, I would like to know whether the model is sensitive to group-specific parameters as well. For example, if a missing value belongs to the 'control' group, is the imputation based only on values from control samples, or from both control and treatment samples? Additionally, how would this apply in the context of single-cell proteomics data analysis? I would appreciate it if you could take a look at my code below and let me know whether I’m running it correctly (F023 in my genotype name).

lfq_data[lfq_data == 0] <- NA
lfq_data <- log2(lfq_data)
sample_names <- colnames(lfq_data)
F023_cols <- sample_names[grepl("F023_C|F023_T", sample_names)]
lfq_data_F023 <- lfq_data[, F023_cols]
F023_Treatment <- ifelse(grepl("F023_C", F023_cols), "LP",
                        ifelse(grepl("F023_T", F023_cols), "HP", NA))
F023_Treatment <- factor(F023_Treatment)
matched_indices <- !is.na(F023_Treatment)
lfq_data_F023 <- lfq_data_F023[, matched_indices]
F023_Treatment <- F023_Treatment[matched_indices]
F023_cols <- F023_cols[matched_indices]
if (any(is.na(F023_Treatment))) {
  stop("Unmatched samples remain after filtering!")
}
F023_Treatment <- relevel(F023_Treatment, ref = "HP")
design <- model.matrix(~ F023_Treatment) 
if (ncol(lfq_data_F023) != nrow(design)) {
  stop("Design matrix and data column count do not match!")
}
y <- new("EList", list(E = as.matrix(lfq_data_F023)))
y.imputed <- dpcImpute(y)
fit <- dpcDE(y.imputed, design, sample.weights = TRUE)
fit <- eBayes(fit, robust = TRUE)
results <- topTable(fit, coef = 2, adjust = "fdr", number = Inf)
dpcfit <- dpc(y)
plotDPC(dpcfit)

ADD REPLY • link 15 days ago by Nasim Gandomdoust • 0

2

Entering edit mode

Your code looks ok (although I can't check the data manipulations that are specific to your own dataset). The DE analysis you have done is based entirely on a CN (complete normal) model. The dpc() function uses an ON (observed normal) model, but you have not yet input the DPC so estimated into your DE analysis. You would do that, if you want, by specifying dpc=dpcfit in the dpcImpute() call.

limpa does not use group information at all at the imputation step, as you can see from the dpcImpute() function and help page. Using group information at this stage would cause "double dipping" and potentially lead to failure to control the FDR rate correctly later on in the DE analysis.

Our published papers and bioRxiv preprint include analyses of single cell proteomics data. See the package documentation.

ADD REPLY • link 14 days ago by Gordon Smyth ★ 8.2k