Question

Pre-processing liquid chromatograph coupled mass spectrometry (LC-MS) spectral counts for downstream analysis (transforming and normalizing)

0

Entering edit mode

6.6 years ago

moldach ▴ 130

For label-free shotgun proteomics relative quantification of proteins/peptides can be done either through spectral counting or intensity based methods. I was given a list of raw spectral counts (SpCs) by a technician and have been tasked with analysis (I am new to the proteomics field, coming from transcriptomics) and now I need to do some pre-processing prior to downstream analysis.

Common pre-processing tasks include log2-transformation to render the intensities more symmetric, and normalization to reduce systematic technical variation while retaining the underlying biological signal (Goeminne et al., 2017).

I tried the tool MSqRob from Goeminne et al., 2017 to do the log2-transformation and quantile normalization; however, I run into problems since the authors vignette example loads their sample data (peptides.txt) using the system.file command and it wasn’t clear to me how one should load their own data [For example, I tried read.table and read.csv commands using a copy of peptides.txt I save in my working directory but when I call peptidesFranc <- read_MaxQuant(file_peptides_txt, pattern="Intensity ", remove_pattern=TRUE) I receive errors]. I contacted the authors last week but have not received any help so I thought I'd ask here and circulate on twitter).

Another tool I found called Crux supports four types of quantification (log2-transformation is applied before any one of these methods I believe?): Normalized Spectral Abundance Factor (NSAF), Distributed Normalized Spectral Abundance (dNSAF), Normalized Spectral Index (SIN) and Exponentially Modified Protein Abundance Index (emPAI). Let's say I wanted to do NSAF for example. NSAF is defined as follows: (NSAF)k = (SpC/Length)/ΣNi=1 (SpC/Length)i, where “SpC” represents spectral counts, “Length” represents the length of protein, and “N” represents the total number of proteins.

You would think one could just load SpCs along with protein lengths into crux and get NSAF; however the input for crux is a collection of scored peptide spectra matches (PSMs).

The help I need from my biostars colleagues is to suggest a package I can use to transform and normalize raw spectral counts (here is a sample -not my data- of what spectral counts look like).

Perhaps I'm approaching this the wrong way. Maybe I DO need to input PSMs (which I believe are from MS2 files?) to get normalized SpCs. This is a file I was never given by the LC-MS technician but maybe they didn't know I needed it?

I guess the third option is to write my own in-house script to do log2-transformation and normalization but I don't see the point in re-inventing the wheel - there HAS to be a package out there somewhere?

Your input is greatly appreciated! Thank you

proteomics normalization spectral counts • 2.2k views

ADD COMMENT • link updated 6.6 years ago by prvst • 0 • written 6.6 years ago by moldach ▴ 130

score 0 · Answer 1 · 2017-09-11

0

Entering edit mode

6.6 years ago

prvst • 0

Hi;

I strongly suggest you to not use or rely on software / packages without support or active development, this could be a sign that the project is abandoned. I also don't suggest you to write your own package right now, specially if you are new to the area. Proteomics data analysis has several particularities that you will need some time to understand.

What you need can be done by using the MSstats (R) package , it's from a very good group dedicated to mass spectrometry-based proteomics statistical analysis, the software is easy to use, they have some video tutorials and lectures on YouTube (look for MayInstitute) and they have an active Google group for questions.

Welcome to proteomics and Good luck !

ADD COMMENT • link 6.6 years ago by prvst • 0

0

Entering edit mode

Thank you for your suggestion. It's often a "Goldilocks" challenge to find software/package that isn't too new (bleeding-edge stuff often is buggy or not well documented for users) or too old - but just right ;)

I agree with you that writing my own package may be difficult seeing as I don't understand all of the (statistical) subtleties yet.

That being said I have a couple questions concerning the MSstats package that I'm hoping you could make clearer.

Section 1.2 of the manual says that the first step "transforms, normalizes and summarize the intensities of the peaks per MS run. This seems like it is taking input from MaxQuant which is spectral intensities and not spectral counts. However, in section 2.1.1 (J) Intensity: it says that "any other quantitative representation of abundance can also be used". So can you clarify that spectral counts (which typically range from 0 to 150) can indeed be used?

I got another response on Twitter from @olgavitek. She said I SHOULD NOT log-transform or (quantile) normalize count data as it is only defined for peak intensities and loses the properties of counts. She says that NSAF (and friends) are useful to compare proteins, but not to compare the abundance of a same protein between conditions. Now she referenced a paper which I would have loved to read up on the details but couldn't as it's pay-walled. Your thought's on these statements?

One other thing I feel is pertinent to add here since I did not mention it in my original post that my goal is to display spectral counts in a heatmap. Raw spectral counts typically range from 0 to 200 but spectral intensities have a much larger range (into 6 figures). I colleague of mine said that by doing a log-transformation on spectral counts that I may lose some of the granularity of count data in my visualization (like the color ramp might not be as obvious) since log2(150) = ~7 and log2(250,000) = ~18. But maybe I'm not thinking clearly and that by adjusting the bin ranges I would be able to maintain granularity of the color ramp differences.

ADD REPLY • link 6.6 years ago by moldach ▴ 130