Question

CPTA, TCGA, RNA seq, validation

0

Entering edit mode

21 months ago

Rob ▴ 170

Hi friends Hope you all doing well.

I want to validate my TCGA analysis with CPTAC transcriptomic data. I don't know why in the validation, my model classifies all patients as one phenotype (I expected two: with and without a feature).

Do you guys have any idea why this is happening? Do you also know what the CPTAC data is? is it z-score? I transformed my TCGA data to z-score to be consistent with CPTAC data.

CPTAC TCGA RNA-seq • 1.5k views

ADD COMMENT • link updated 21 months ago by Ernest Bonat ▴ 10 • written 21 months ago by Rob ▴ 170

0

Entering edit mode

hello,

could you explain more about what TCGA are you using? feel free post 3-5 rows and select the features and label(s)? to see the possible binary classification machine learning project.

ADD REPLY • link 21 months ago by Ernest Bonat ▴ 10

0

Entering edit mode

Thanks @Ernest for responding. I used HT seq raw count data of TCGA. I normalize and the calculate Z-score. then I make model and I apply the best model (classifier) for CPTAC data to validate my work.

This is my TCGA data after normalization and converting to Z-score:

ADD REPLY • link 21 months ago by Rob ▴ 170

0

Entering edit mode

thanks @Rob, i understand that data scaling (normalization) is the next step after data split in Machine Learning project workflow, but why the need to calculate the z-score? Can you share the link where you download the HT seq raw count data of TCGA? Feel free to read the following blog paper: Apply Machine Learning Algorithms for Genomics Data Classification.

ADD REPLY • link 21 months ago by Ernest Bonat ▴ 10

score 1 · Answer 1 · 2022-07-26

1

Entering edit mode

21 months ago

i.sudbery 19k

My understanding is that CPTAC is a proteomics project, and therefore the measurements will be proteomics data, where as the TCGA data is RNA-seq (amoung other things), and therefore transcriptomics. I think its is not surprising that when you apply proteomics data to a model trained on transcriptomics that it doesn't work.

Transcript level is not perfectly correlated with protein level (far from it in some cases). In addition RNAseq data will quantify many genes that are not in the proteomics data (such as non-coding RNAs, different splice isoforms which may produce the same or different peptides etc). In addition each techniques is subject to different biases.

ADD COMMENT • link 21 months ago by i.sudbery 19k

0

Entering edit mode

My bad, there appears to be transcriptome data in CPTAC as well.

ADD REPLY • link 21 months ago by i.sudbery 19k

0

Entering edit mode

yes, you made good points. i saw some mRNA downloads sites include a file with normalized z-score dataset too. I would like to know if this is the best practice?

ADD REPLY • link 21 months ago by Ernest Bonat ▴ 10

0

Entering edit mode

No, a best practice would simply be the raw unchanged counts because normalization of RNA-seq data is trivially simple starting from these counts via packages like edgeR or DESeq2 (it is really just a one-liner), same with standardization or any simple transformation like log2. Providing these transformed values for download, often without any code, is just an annoying blackbox (all imo).

ADD REPLY • link 21 months ago by ATpoint 82k

0

Entering edit mode

sure, but you will need to scale the x features in machine learning before fitting the models anyway...

ADD REPLY • link 21 months ago by Ernest Bonat ▴ 10