I have a dataset which consists of around 300 plasma samples taken from patients diagnosed with breast cancer (of 5 different subtypes). There are 11 proteins which were measured using label free LC-MS for those samples. My goal is to find proteins with are distributed differently among any pair of breast cancer subtypes.
I performed 2 different procedures in parallel.
Procedure 1: log transform the data, apply median transformation (from each protein value within a sample the median of all proteomic values in that sample is subtracted). The ComBat is applied to remove batch effects (based on PCA which shows clustering by plate). The Kruskal-Wallis test is applied. I get no groups which exhibit any difference in distributions.
Procedure 2: log transform the data, apply quantile transformation [using normalize.quantiles function where samples are columns and each row corresponds to a protein], perform ComBat. The Kruskal-Wallis test is applied. I get 10 pairs of groups for which significance is yielded.
How do i determine the correct normalisation method given such different results?
Many thanks in advance
can you tell me the how did you start from the raw input, since I do have raw files which I have converted them into mzML files. Now i did see few R workflows but I'm not clear what values should be taken out for the differential analysis as I also have two groups.
Can you tell what workflow should i follow to which can take the mzML files and I can get the input data to run it in R for various EDA and differential analysis.
For this context I would say go ahead with edgeR
I work with files received from another team, which are in txt format. There are some packages in R which help to handle mzML files (e.g. https://lgatto.github.io/RforProteomics/articles/RforProteomics.html).
As for edgeR, i am not sure, generally workflow includes something which is described here for instance : https://www.embopress.org/doi/10.15252/msb.202110240?__cf_chl_tk=tYXsFyxHm2gCX9n9IleCnZRgUzRI_HDNlEekj26tK6Y-1736619325-1.0.1.1-EweDmTcurrV0MUEopdWAIarPS_uVl7quj4qF7djAjds
what these values are can you show me sample input so that I have idea what to extract from them mzML files.
1769mkc You can use
>
(or the"
icon in edit window) toquote
parts of a post that you want to respond to with a comment."This workflow starts with a raw data matrix, for which initial steps such as peptide‐spectrum matching, quantification, and FDR control have been completed. Data are assumed to be log‐transformed unless the variance stabilizing transformation (Durbin et al, 2002) is used. In the latter case, the data transformation is included in the normalization procedure."
so how do you get the raw matrix is it like gene count file if yes that you can use deseq2 or edger which can take input and you can do the batch correction there it self by specifiying th