Question: TCGA - Correlation between gene expression and CNV
0
gravatar for rin
9 months ago by
rin30
rin30 wrote:

Hi everyone

I am new here and at the bioinformatics world and I would appreciate your help. I am currently looking into correlating gene expression and CNV data from TCGA, most probably about colorectal or ovarian cancer. After some data exploration, I found out than only a small percentage of samples are from normal tissues. That being said, should the DEGs identification be done only between paired (tumor - normal) samples, even if the statistical power would be low? With the aim of correlating the above mentioned data, a meaningful correlation analysis would be 1. between DEGs and amplified/deleted genes or 2. correlation between the expression (not taking into account differential expression, but all the expression data from tumor samples) and the CNV?

Thanks for helping!

cnv rna-seq correlation tcga R • 587 views
ADD COMMENTlink modified 9 months ago by Kevin Blighe41k • written 9 months ago by rin30
1
gravatar for Kevin Blighe
9 months ago by
Kevin Blighe41k
Guy's Hospital, London
Kevin Blighe41k wrote:

Yes, the number of Tumour-Normal pairs in the TCGA RNA-seq data is low. Others have somewhat circumvented this issue by not doing any direct comparisons and instead answering the question: 'What is highly and lowly expressed in the tumour and normal samples separately?' This is how cBioPortal does it, and the default is Z-score > 2 for highly expressed and Z-score < 2 for lowly expressed. Z-scores should ideally be produced from the logged, normalised counts.

I would take this approach (above) and correlate the highly and lowly expressed genes to the CNVs.

Of course, any logical approach will be fine.

Kevin

ADD COMMENTlink written 9 months ago by Kevin Blighe41k
1

Thank you a lot for your comments and help, Kevin!

ADD REPLYlink written 9 months ago by rin30

Hi again!

Looking at it a little more, I have seen that a NB distribution is used from DEseq2 and EdgeR to normalize gene expression data, meaning that a Z-score would not be valid ( or at least have similar interpretation) as if when using a normal distribution. Am I understanding something wrong?

Elaborating a little more to make myself as clear as possible. A possible workflow would be:

  1. Check if raw count data downloaded from TCGA follow a normal distribution.
  2. If not, log2 transform.
  3. Remove genes with low read counts.
  4. Calculate mean and st.dev of Gene A across samples >> Get a z-score for Gene A
  5. Repeat for all genes.
  6. Select genes with score > or < 2.

Are there any steps that I am missing/not understanding correctly? In other words, normalization techniques proposed, such as those using median or quantiles, should not be considered?

When it comes to the correlation: CNVs will have to be done by pairwise comparison of normal-tumor samples. Would it still be valid to correlate them to the genes found from the process above?

Thanks once again!

ADD REPLYlink modified 9 months ago • written 9 months ago by rin30

The idea was to download TCGA RSEM counts, normalise them in DESeq2 / EdgeR, produce logged data from this (via regularised log in DESeq2 or logCPM in EdgeR), and then transform to Z-scale. I would then obtain the CN segment data from Broad Institute's Firebrowse server, and, finally, conduct either a correlation or regression analysis between the RNA-seq genes with |Z|>2 or 3 and the CN segments identified. There will obviously be other issues along the way.

ADD REPLYlink written 9 months ago by Kevin Blighe41k

Hi Kevin! Coming back to this (almost ancient now) post for a follow-up question!

I used indeed DESeq2 with a design of ~tumor + normal. One think I am quite unsure about is whether I should compute the Z-score, as (expression in my samples - mean expression in normal samples) / st. dev of expression in normal samples from the results of rlog.

Am I missing something?

Thank you!

ADD REPLYlink written 6 months ago by rin30

Hey rin, To transform to Z-scores, you just need to do:

t(scale(t(data)))
ADD REPLYlink written 6 months ago by Kevin Blighe41k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1086 users visited in the last hour