Hello,
I did a quantitative proteomics experiment to measure the differential expression of proteins in cells between two conditions. The output is a list of peptides, the protein they map to, and the their abundance for the experimental and control condition. Each protein has several detected peptides, and I need to pull out the median peptide abundance per protein, per condition into a new data frame. A simple version is as follows below:
| protein | peptide |condition 1 abundance | condition 2 abundance |
| -------- | ------------| ---------------------| --------------------- |
| protein 1 | A APGSR | 1 | 4
| protein 1 | ASTGR | 2 | 5
| protein 2 | ASTTGAR | 3 | 6
| protein 2 | PAGPAPTR | 3.5 | 7
| protein 2 | VPSTR | | 5
Is there a way to write code for this in R? Note that I have about 6000 proteins, and about 60,000 detected peptides. Not all peptides were detected in both condition 1 and 2, but I would still need to take the median of all peptides per protein for each condition separately.
The goal is to do statistical analysis between the median peptide abundance for each protein so I can see if the values are significantly different.
Thanks in advance!
Please be sure to use specialized software for this. In Bioconductor there are e.g. the
DEqMS
orDEP
packages wich provide sound statistical frameworks. Don't start with putting these medians into custom tests such as t-tests or anything like that. If you really need the averages I suggest you look e.g. atdplyr
tutorials, on how to do medians per group. Hint, it will come down to thegroup_by
argument to group the data.frame by peptide (and protein), and then runmedian
on it. Try something please, these kinds of coding skills are essential for a bioinformatician. Happy to help once you get stuck.