Question

Questions on analyzing TCGA proteome profiling data.

0

Entering edit mode

15 months ago

walden • 0

Sorry for any mistakes. English is not my native language

I am trying to create a machine learning model that takes TCGA proteome profiling data as its input.

I hope someone could help me understand meaning of columns and values of downloaded TCGA proteome matrixes.

I downloaded the TCGA proteome matrixes. with following options. (package TCGAbiolinks was used in R environement)

library(TCGAbiolinks)
query_protein<- GDCquery(project="TCGA-BRCA",
                     data.category="Proteome Profiling",
                     data.type="Protein Expression Quantification",
                     experimental.strategy='Reverse Phase Protein Array'
) 
GDCdownload(query_protein)

Then, I opened one of downloaded TSV files.

enter image description here

The column in question is "protein_expression" column.

Why are there some negative values in the column?
Does it mean that some sort of normalization/standardization measures were used?
If the said measures were used, are there any detailed explanations of the used measures on the internet?
If normalization/standardization measures were not used, are there any R packages that can normalize/standardize this data? The data should be normalized before using them as inputs for my machine learning model.

There is a lot of usefule tips and resources about TCGA transcriptome data on the internet. However, i was unable to find good tips about analyzing TCGA proteome data. Your help would be greatly appreciated!!

proteome R TCGA • 783 views

ADD COMMENT • link updated 14 months ago by pilargmarch ▴ 110 • written 15 months ago by walden • 0

score 0 · Answer 1 · 2023-01-26

You should always check the GDC documentation page, as it explains all pipelines used to process the data. From the references in GDC Docs' Protein Expression Entry:

From MD Anderson Cancer Center on their RPPA protocol:

The slides were scanned, analyzed, and quantified using Array-Pro Analyzer software (MediaCybernetics) to generate spot intensity (Level 1 data). SuperCurve GUI (2), was used to estimate relative protein levels (in log2 scale). A fitted curve ("Supercurve") was created with signal intensities on the Y-axis and relative log2 amounts of each protein on the X-axis using a non-parametric, monotone increasing B-spline model (1). Raw spot intensity data were adjusted to correct spatial bias before model fitting using “control spots” arrayed across the slides (3). A QC metric (4) was generated for each slide to determine slide quality and only slides greater than 0.8 on a 0-1 scale were included for further processing. For replicate slides, the slide with the highest QC score was used for analysis (Level 2 data). Protein measurements were corrected for loading as described (2,5) using bidirectional median centering across samples and antibodies (Level 3 data).

Another link from MD Anderson, explained with more detail:

Based on Level 2 data, the data normalization is processed as follows:

Calculate the median for each protein across all the samples.

Subtract the median (from step 1) from values within each protein.

Calculate the median for each sample across all proteins.

Subtract the median (from step 3) from values within each sample.

So yes, the data is normalized and log2-transformed, which is why there are negative and positive protein expression values.