Question

Group/Cluster tumor samples in an RNA-Seq dataset for survival analysis, based on the expression of a small gene signature

0

Entering edit mode

5.5 years ago

svlachavas ▴ 790

Dear Community,

based on a collaboration project, we have identified a small gene signature, discriminating relatively well control samples from tumor ones, based on an initial microarray analysis. The more important thing is that these genes, are also correlated with specific clinical imaging variables. Thus, our next step is to validate in independent TCGA RNA-Seq dataset of the same cancer, the prognostic impact of these genes-as in these TCGA datasets we dont have the clinical variables information-and especially if the correlated signature, could identify patient groups that have different survival estimates.

Thus, my major question is the following: as this signature is relatively small: 22 genes, 10 up-reg in tumors, and the 12 down-reg (based on the microarray DE analysis), how i should proceed or cluster the tumor samples based on this signature in the RNA-Seq independent dataset ? i should not take into account the normal samples included ? and focus only on the tumors ?

For example, for each gene, create a median expression score across all tumors, then separate the tumor samples into high and low ? And then, create for the total signature an average score ? Or this is biased, as i do not take into account the expression direction from the microarray analysis ? and i should proceed with something more robust ?

My main notion, is not to perform survival analysis for each separately-rather, if possible, identify based on the whole signature, groups of tumor samples, that are separated based on groups of these genes.

Thank you in advance,

Efstathios

RNA-Seq gene signature survival clustering R • 1.7k views

ADD COMMENT • link updated 5.5 years ago by Kevin Blighe 87k • written 5.5 years ago by svlachavas ▴ 790

score 3 · Answer 1 · 2018-10-04

3

Entering edit mode

5.5 years ago

Kevin Blighe 87k

Hey bro,

There are different ways of approaching this, and each can be valid. If you ask 2 people how to conduct survival analysis using gene expression data, you will receive 2 different answers.

If you have a 22-gene signature already, then you can generate many different things from that. You could do the following:

create a combined model as:

.

glm(TumourNormalStatus ~ gene1 + gene2 + ... + gene22, data=data, family=binomial(link="logit"))

reduce the model further via stepwise regression (OPTIONAL)
determine 'robustness' (loosely termed) of the model via r-squared shrinkage and cross validation (C: Resources for gene signature creation)
perform ROC analysis

NB - the exponential of the coefficient from glm is an odds ratio

---------------------------------------------------------

For 'survival' analysis, you could do Cox proportional hazards (Cox PH) regression. However, you do not mention that you have any survival phenotype / clinical data? In this way, you could do the following:

transform your gene expression into Z-scores
encode each gene's expression into high, medium, low based on Z-score thresholds (e.g. high, >2; low, <-2)
test each gene independently in a Cox PH model as follows:

.

coxph(Surv(TumourNormalStatus) ~ Gene1)
coxph(Surv(TumourNormalStatus) ~ Gene2)

NB - the exponential of the coefficient from coxph is a hazards ratio

Coincidentally, svlachavas, I have been developing a tutorial for my new Bioconductor package that includes a comprehensive part on Cox PH survival models: Survival analysis via Cox Proportional Hazards regression

Kevin

ADD COMMENT • link 5.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Dear Kevin,

thanks a lot for the detailed feedback and your suggestions-based on your ideas, in order to respond and ask some crusial questions on very specific points:

A) Firstly, for your initial suggestion about the combined model with the glm function: if i have understood well, this approach would be mainly to further reduce the initial signature of the 22genes-or more suitably a bigger signature-to less discriminator genes, but based on the initial phenotype under study ? that is cancer vs tumor samples ?

glm(TumourNormalStatus ~ gene1 + gene2 + ... + gene22, data=data, family=binomial(link="logit"))

in which above, TumourNormalStatus is the binary factor variable in the RNA-Seq data, indicating tumor or normal samples ?

And your basic notion for this, is that before moving to survival step, eliminate further the size of the initial microarray signature, but now testing in the RNA-Seq level ? with a more "sophisticated" way than the direct DE analysis ?

B) Regarding the actual survival analysis-about your question for clinical data-yes, through TCGAbiolinks, i have full clinical data for survival, as for other variables, such as Tumor Stage, etc-however, except survival, the other variables have a lot of NA values-which in your opinion, would not compromize a Cox proportional hazard analysis, as it depends on added coefficients ? except the gene expression ?

C) Again for survival, but for the direct approach-in your opinion, you would follow "a gene-by-gene" survival approach ? as mentioned by z-score above to separate the tumor samples, but also take into account the normal, for more robust results ? or for example using quantile thresholds ? and then highlight or pinpoint any of the 22 genes that show a significant result in survival ?

Or you believe, a "gene-set" approach, would be "more biologically appealing " ? and separate groups of tumor only samples, without taking into account the normal ones ? something like with ConsensusClusterPlus ? and see, if any interesting groups of gene expression patterns appear in specific clusters ?

Thank you for your time,

Efstathios

ADD REPLY • link 5.5 years ago by svlachavas ▴ 790

0

Entering edit mode

A) Firstly, for your initial suggestion about the combined model with the glm function: if i have understood well, this approach would be mainly to further reduce the initial signature of the 22genes-or more suitably a bigger signature-to less discriminator genes, but based on the initial phenotype under study ? that is cancer vs tumor samples ?

glm(TumourNormalStatus ~ gene1 + gene2 + ... + gene22, data=data, family=binomial(link="logit"))

in which above, TumourNormalStatus is the binary factor variable in the RNA-Seq data, indicating tumor or normal samples ?

And your basic notion for this, is that before moving to survival step, eliminate further the size of the initial microarray signature, but now testing in the RNA-Seq level ? with a more "sophisticated" way than the direct DE analysis ?

Yes, the idea can be about reducing the size of the model, but it is optional. If you are happy with the 22-gene signature, then you should continue with that. If, however, it performs poorly in cross-validation and ROC analysis, then you may consider going back a few steps and reducing the signature in size.

------------------------------------------------

B) Regarding the actual survival analysis-about your question for clinical data-yes, through TCGAbiolinks, i have full clinical data for survival, as for other variables, such as Tumor Stage, etc-however, except survival, the other variables have a lot of NA values-which in your opinion, would not compromize a Cox proportional hazard analysis, as it depends on added coefficients ? except the gene expression ?

Yes, the TCGA clinical data is very sparse, i.e., contains many NAs. This makes survival analysis difficult. What is the sample size for those patients who have complete survival data?

Cox proportional hazards can be used for survival analysis, but you can also use it for any binary trait, such as Tumour-Normal status. If you are not going to use any actual survival data, then I would just use the glm() and continue with that.

---------------------------------------------------

C) Again for survival, but for the direct approach-in your opinion, you would follow "a gene-by-gene" survival approach ? as mentioned by z-score above to separate the tumor samples, but also take into account the normal, for more robust results ? or for example using quantile thresholds ? and then highlight or pinpoint any of the 22 genes that show a significant result in survival ?

You have no distinction in your tumour samples, for example, Grade I,Grade II, Grade III, et cetera. Also, your signature was built by comparing Tumour Vs Normal in microarray. So, you need the normal samples as comparison. If you have tumour grade or histology information, then you can use that as the y / end-point in your model and avoid the use of the normals. I cannot see the data that you have, so, everything that I write is speculatory.

You can test all genes in the same formula, you can test each gene independently and then create a 'final formula' of the significant genes, or you can reduce the 22-gene model via stepwise regression. There are diverse ways of working with the data.

You can definitely divide the gene expression by quantile, too.

----------------------------------------------------

Or you believe, a "gene-set" approach, would be "more biologically appealing " ? and separate groups of tumor only samples, without taking into account the normal ones ? something like with ConsensusClusterPlus ? and see, if any interesting groups of gene expression patterns appear in specific clusters ?

You should do this in addition to everything else. You should then test to see if your identified clusters have any relation to clinical data. As I showed in the vitamin D study, the clusters were statistically significantly related to the dosage of vitamin D received by the mother during pregnancy.

Hope that this helps!

ADD REPLY • link 5.5 years ago by Kevin Blighe 87k