Questions on analyzing TCGA proteome profiling data.
1
0
Entering edit mode
15 months ago
walden • 0

Sorry for any mistakes. English is not my native language

I am trying to create a machine learning model that takes TCGA proteome profiling data as its input.

I hope someone could help me understand meaning of columns and values of downloaded TCGA proteome matrixes.

I downloaded the TCGA proteome matrixes. with following options. (package TCGAbiolinks was used in R environement)

library(TCGAbiolinks)
query_protein<- GDCquery(project="TCGA-BRCA",
                     data.category="Proteome Profiling",
                     data.type="Protein Expression Quantification",
                     experimental.strategy='Reverse Phase Protein Array'
) 
GDCdownload(query_protein)

Then, I opened one of downloaded TSV files.

enter image description here

The column in question is "protein_expression" column.

  1. Why are there some negative values in the column?
  2. Does it mean that some sort of normalization/standardization measures were used?
  3. If the said measures were used, are there any detailed explanations of the used measures on the internet?
  4. If normalization/standardization measures were not used, are there any R packages that can normalize/standardize this data? The data should be normalized before using them as inputs for my machine learning model.

There is a lot of usefule tips and resources about TCGA transcriptome data on the internet. However, i was unable to find good tips about analyzing TCGA proteome data. Your help would be greatly appreciated!!

proteome R TCGA • 783 views
ADD COMMENT
0
Entering edit mode
14 months ago
pilargmarch ▴ 110

You should always check the GDC documentation page, as it explains all pipelines used to process the data. From the references in GDC Docs' Protein Expression Entry:

The slides were scanned, analyzed, and quantified using Array-Pro Analyzer software (MediaCybernetics) to generate spot intensity (Level 1 data). SuperCurve GUI (2), was used to estimate relative protein levels (in log2 scale). A fitted curve ("Supercurve") was created with signal intensities on the Y-axis and relative log2 amounts of each protein on the X-axis using a non-parametric, monotone increasing B-spline model (1). Raw spot intensity data were adjusted to correct spatial bias before model fitting using “control spots” arrayed across the slides (3). A QC metric (4) was generated for each slide to determine slide quality and only slides greater than 0.8 on a 0-1 scale were included for further processing. For replicate slides, the slide with the highest QC score was used for analysis (Level 2 data). Protein measurements were corrected for loading as described (2,5) using bidirectional median centering across samples and antibodies (Level 3 data).

Based on Level 2 data, the data normalization is processed as follows:

  1. Calculate the median for each protein across all the samples.
  2. Subtract the median (from step 1) from values within each protein.
  3. Calculate the median for each sample across all proteins.
  4. Subtract the median (from step 3) from values within each sample.

So yes, the data is normalized and log2-transformed, which is why there are negative and positive protein expression values.

ADD COMMENT

Login before adding your answer.

Traffic: 2474 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6