I have mass-spectrometry-based proteome data of 6 control and 3 treated sample. There are random number of valid LFQ intensity per protein in each group. For example for a random protein 2 samples in control group and 1 sample in treated group have valid values. There are sometime more or less. There are cases also that per a specific protein, only one random sample from each group have valid value. And I am looking for differentially expressed proteins between control and treated. I don’t want to loose any of data. Could you please tell me what statistical method should I use for my analysis? How to transform and impute the data?
For tl;dr, it's the limpa package https://bioconductor.org/packages/release/bioc/html/limpa.html
Thank you for your reply. I would like to know if the package can handle batch effects and outlier samples too.
limpa shares the full capabilities of the limma package, which includes adjustment for batch effects and outliers. To adjust for batch effects, include the relevant factors or covariates in the design matrix, as you would do for limma or edgeR. To detect and downweight outlier samples, include the argument
sample.weights=TRUE
in the call todpcDE
.Thank you for your answer. I conducted an analysis using limpa and found it quite interesting. As far as I understand, both ON and CN models are constructed per peptide (or per precursor) across multiple samples (please correct me if I’m wrong). Now, I would like to know whether the model is sensitive to group-specific parameters as well. For example, if a missing value belongs to the 'control' group, is the imputation based only on values from control samples, or from both control and treatment samples? Additionally, how would this apply in the context of single-cell proteomics data analysis? I would appreciate it if you could take a look at my code below and let me know whether I’m running it correctly (F023 in my genotype name).
Your code looks ok (although I can't check the data manipulations that are specific to your own dataset). The DE analysis you have done is based entirely on a CN (complete normal) model. The
dpc()
function uses an ON (observed normal) model, but you have not yet input the DPC so estimated into your DE analysis. You would do that, if you want, by specifyingdpc=dpcfit
in thedpcImpute()
call.limpa does not use group information at all at the imputation step, as you can see from the dpcImpute() function and help page. Using group information at this stage would cause "double dipping" and potentially lead to failure to control the FDR rate correctly later on in the DE analysis.
Our published papers and bioRxiv preprint include analyses of single cell proteomics data. See the package documentation.