Question: Statistical Analysis Of Proteomic Spectral Count Data
3
gravatar for Julian
7.5 years ago by
Julian200
Manchester, UK
Julian200 wrote:

How do people process spectral count data from protein mass spectrometry / proteomics data? The number of zero's in the data (most of which, due to the way the data is produced, you can't reliably define as zero). A standard t-test / anova doesn't seem appropriate as the data isn't normally distributed. The data follows a Chi-squared distribution.

I have spectral count data for each protein in samples from an experiment of a control v. test. There are 3 biological replicates for each. The data has been put through Mascot and Scaffold. I am looking, in an ideal world, for a p-value for each protein - not just across the samples. I am looking to produce some reliability on the proteins being selected for review (something over and above what the TPP / Scaffold are supplying).

I've been pointed to using the g-test and/or the Chi-squared test. It's also been suggested to use the Fisher test. I've tried these in R, without loads of success. I've tried using a spreadsheet in Excel to calculate a g-test value too. I've also used PepC. One of the particular issues I am having is generating the statistics on a protein basis as opposed to the sample basis.

I've also tried to look at some of the microarray processing techniques, but it appears to me that the large number of zero's in proteomic spectral-count data negates these approaches.

Any experience and/or advice would be greatly appreciated. Thanks.

proteomics statistics • 7.0k views
ADD COMMENTlink written 7.5 years ago by Julian200
0
gravatar for Alastair Kerr
7.5 years ago by
Alastair Kerr5.2k
The University of Edinburgh, UK
Alastair Kerr5.2k wrote:

We collaborate with Laurence Florens and use her dNSAF (distributive normalized SAF) method in our papers.

This paper (Zhang et al: Anal. Chem., 2010, 82 (6), pp 2272–2281) has her methodology and reasoning.

ADD COMMENTlink written 7.5 years ago by Alastair Kerr5.2k
0
gravatar for Woa
7.5 years ago by
Woa2.7k
United States
Woa2.7k wrote:

Two more methods of spectral count based quantification that I know about:

The normalized spectral index method:

http://www.nature.com/nbt/journal/v28/n1/full/nbt.1592.html

and the 'Spectral Index' based method:"Spectral Index for Assessment of Differential Protein Expression in Shotgun Proteomics"

http://pubs.acs.org/doi/abs/10.1021/pr070271+

Check the Supplementary Data for both the papers for codes etc.

ADD COMMENTlink written 7.5 years ago by Woa2.7k
0
gravatar for C Shao
7.2 years ago by
C Shao130
C Shao130 wrote:

Based on my experience, the peptide with less than 2 hits should be discarded. We then used NSAF-PLGEM and Qspec to find the Differentially expressed gene using the cleaned data. NSAF-PLGEM seems to find more.

ADD COMMENTlink written 7.2 years ago by C Shao130
0
gravatar for Ahill
7.2 years ago by
Ahill1.5k
United States
Ahill1.5k wrote:

Agree with above answers, these are very good suggestions for improvements on basic spectral counting. Regarding your question about using the Fisher test at the protein level on spectral counts, here is an example in R of applying the test to estimate differential expression in a Treated vs. Control sample pair. In a simple case where you have 810 confident peptide spectrum matches in the Treated sample (110 mapping exclusively to protein A, and 700 mapping to other proteins); and 1050 peptide spectrum matches in the Control sample (50 mapping only to A, and 1000 mapping to other proteins) then:

> spectral.counts
              Treated Control
Protein A         110      50
Not Protein A     700    1000
> ft <- fisher.test(spectral.counts)
> ft

        Fisher's Exact Test for Count Data

data:  spectral.counts 
p-value = 2.03e-11
alternative hypothesis: true odds ratio is not equal to 1 
95 percent confidence interval:
 2.195404 4.544884 
sample estimates:
odds ratio 
  3.140917

If you are looking to compare the relative abundance of proteins within a sample, in most bottom-up spectral-count MS datasets, that is not possible to do accurately, due to intrinsic differences in detectability among proteins.

ADD COMMENTlink written 7.2 years ago by Ahill1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1049 users visited in the last hour