Question

Statistical Analysis Of Proteomic Spectral Count Data

3

Entering edit mode

12.6 years ago

Julian ▴ 200

How do people process spectral count data from protein mass spectrometry / proteomics data? The number of zero's in the data (most of which, due to the way the data is produced, you can't reliably define as zero). A standard t-test / anova doesn't seem appropriate as the data isn't normally distributed. The data follows a Chi-squared distribution.

I have spectral count data for each protein in samples from an experiment of a control v. test. There are 3 biological replicates for each. The data has been put through Mascot and Scaffold. I am looking, in an ideal world, for a p-value for each protein - not just across the samples. I am looking to produce some reliability on the proteins being selected for review (something over and above what the TPP / Scaffold are supplying).

I've been pointed to using the g-test and/or the Chi-squared test. It's also been suggested to use the Fisher test. I've tried these in R, without loads of success. I've tried using a spreadsheet in Excel to calculate a g-test value too. I've also used PepC. One of the particular issues I am having is generating the statistics on a protein basis as opposed to the sample basis.

I've also tried to look at some of the microarray processing techniques, but it appears to me that the large number of zero's in proteomic spectral-count data negates these approaches.

Any experience and/or advice would be greatly appreciated. Thanks.

proteomics statistics statistics • 9.5k views

ADD COMMENT • link updated 4.4 years ago by Biostar 20 • written 12.6 years ago by Julian ▴ 200

score 0 · Answer 1 · 2011-10-04

0

Entering edit mode

12.6 years ago

Alastair Kerr 5.3k

We collaborate with Laurence Florens and use her dNSAF (distributive normalized SAF) method in our papers.

This paper (Zhang et al: Anal. Chem., 2010, 82 (6), pp 2272–2281) has her methodology and reasoning.

ADD COMMENT • link 12.6 years ago by Alastair Kerr 5.3k

Ram · Answer 2 · 2011-10-04

Two more methods of spectral count based quantification that I know about:

The normalized spectral index method:

http://www.nature.com/nbt/journal/v28/n1/full/nbt.1592.html

and the 'Spectral Index' based method:"Spectral Index for Assessment of Differential Protein Expression in Shotgun Proteomics"

http://pubs.acs.org/doi/abs/10.1021/pr070271+

Check the Supplementary Data for both the papers for codes etc.

score 0 · Answer 3 · 2012-02-24

0

Entering edit mode

12.2 years ago

C Shao ▴ 140

Based on my experience, the peptide with less than 2 hits should be discarded. We then used NSAF-PLGEM and Qspec to find the Differentially expressed gene using the cleaned data. NSAF-PLGEM seems to find more.

ADD COMMENT • link 12.2 years ago by C Shao ▴ 140

score 0 · Answer 4 · 2012-02-24

Agree with above answers, these are very good suggestions for improvements on basic spectral counting. Regarding your question about using the Fisher test at the protein level on spectral counts, here is an example in R of applying the test to estimate differential expression in a Treated vs. Control sample pair. In a simple case where you have 810 confident peptide spectrum matches in the Treated sample (110 mapping exclusively to protein A, and 700 mapping to other proteins); and 1050 peptide spectrum matches in the Control sample (50 mapping only to A, and 1000 mapping to other proteins) then:

> spectral.counts
              Treated Control
Protein A         110      50
Not Protein A     700    1000
> ft <- fisher.test(spectral.counts)
> ft

        Fisher's Exact Test for Count Data

data:  spectral.counts 
p-value = 2.03e-11
alternative hypothesis: true odds ratio is not equal to 1 
95 percent confidence interval:
 2.195404 4.544884 
sample estimates:
odds ratio 
  3.140917

If you are looking to compare the relative abundance of proteins within a sample, in most bottom-up spectral-count MS datasets, that is not possible to do accurately, due to intrinsic differences in detectability among proteins.