Discretisation of gene expression data to perform survival analysis
1
0
Entering edit mode
2.8 years ago
Lindsay865 ▴ 60

Hello all,

I am probably being incredibly naive but I've only studied bioinformatics for a year so forgive me for this question.

My work so far: I've performed differential gene expression analysis between Standard Risk and Low Risk AML patients using DESeq2 in R and successfully identified differentially expressed genes. Whilst running DESeq2 I've obtained variance stabilising transformed (vst) expression values for all genes. 30 of these differentially expressed genes matched with those on a list of genes of interest so I've subsetted vst values for those genes. I am expected to perform survival analysis based on the gene expression of these genes of interest which are differentially expressed and had found a question answered by Michael Love suggesting to perform survival analysis on vst data (sorry, unable to locate the post) but it didn't specify how this could be completed. It was suggested to me by my supervisor to use "quartile expression of genes" so I discretised these vst values by assigning them to either Q1, Q2, Q3 or Q4 for each gene. I found that for some genes the Q1, Q2, and Q3 thresholds were the same number leading to problems in discretising expression values.

My question: I now think that I shouldn't be using vst data for this purpose or I shouldn't be using quartiles, perhaps both. I had a thought of computing Z-scores from vst data (probably wrong about that one) or Z-scores from the raw counts and then discretising from there into perhaps Low/Medium/High expression categories and performing survival analysis - similar to this tutorial by Kevin Blighe.

Has anybody got any further advice on this?

Thanks so much in advance for any help given,

Lindsay

RNA-Seq R survival analysis gene expression • 1.1k views
2
Entering edit mode
2.8 years ago

Hey Lindsay, there really is no right or wrong, in this regard. In my tutorial, I actually run an independent model for each gene as measured on the continuous scale (and the data is from microarray; so, quantile normalised, log2-transformed, and measured on the normal/Gaussian distribution). I then take a closer look at the genes that have a statistically significant p-value from this initial screen and only then encode them as Low | Mid | High via Z-scores. Once encoded, they are again checked as categorical variables in the coxph() survival model and then plotted. The results in the tutorial doing it this way were in line with the literature, as you can see.

Your idea of using quartiles is perfectly valid. Be sure that you encode a reference level when you do this. For example, having the reference as the lowest quartile makes for a different clinical interpretation than having it as one of the middle or higher levels. I neither see any major issue in using VST counts. If Michael Love recommends that, and anything that he states regarding RNA-seq supersedes anything that I say, of course, then that is what I would also recommend. Without going into a long description on it, I can understand why VST would be preferable to rlog in this case.

By the way, for me, tertiles makes more intuitive sense, and is more suitable if sample n is low.

Kevin

0
Entering edit mode

If I stick to using quartiles with VST counts I still have the issue of not knowing which quartile some of the VST values fall into, eg. for one gene across numerous (51 out of 62) patients the VST value is the exact same - leading to the Q1, Q2 and Q3 thresholds being the exact same. I don't really know what to do about this. Even if I obtained Low/Mid/High tertiles (using Z-scores of VST counts using scale() perhaps?), I think this would mean that there would be no patient samples falling into some of these categories also.

By the way, my sample n is 31 for each group (Standard and Low Risk) so fairly low.

Any further comments would be more than welcome - very stuck with this analysis, unfortunately. Apologies for my naivety.

Thanks, Lindsay

**EDIT: I tried rlog transformation instead of VST and I no longer have an issue with sorting values into quartiles as no values seem to be repeated across lots of patients. Is this acceptable to use?

0
Entering edit mode

Hey, sorry, I was away overnight. Just wondering why many (?) of your genes end up having the same value when transformed via the variance-stabilising transformation? Are these non-coding genes or of low values, by any chance? Did you perform any pre-filtering for low-count genes prior to normalisation?

Using regularised log counts is fine, too - just merely means stating something different in your materials and methods! The 'genuine' genes of interest that are associated with your outcome / endpoint should be identified with either rlog or vst counts. The 'fringe' genes that may not have much to do with your outcome of interest may appear in one or the other, or neither.

Kevin

0
Entering edit mode

Hi Kevin,

No need to apologise - very grateful for any time you've spent reading my question/comments!

I've focused on 30 coding genes and most VST values look ok. Maybe ~5-10 of these genes have same VST value across many patient samples - after examining the raw counts, some of these genes have low counts but others don't. Unsure what could be causing it. I've also attempted to filter low-count genes prior to DESeq():

#Setting up countData object

#Removing genes with sum total of 10 reads across all samples
ddsTARGET<-ddsTARGET[rowSums(counts(ddsTARGET))>=10,]

#Running DESeq2
ddsTARGET<-DESeq(ddsTARGET)

#VST
vstTARGET<-varianceStabilizingTransformation(ddsTARGET, blind=TRUE)


I think I'll perhaps stick to rlog transformation for now if I can't get around that issue (despite it taking forever to run).

Thanks so much, Lindsay

0
Entering edit mode

You are using all genes for the purpose of normalisation and the subsequent vst, right? Are you aware of the blind=TRUE/FALSE functionality, too? With blind=TRUE, the transformation is performed independent of your design formula. With design=FALSE, the transformation will actually adjust your transformed counts based on the factors supplied in your design formula. For downstream processes, blind=FALSE is recommended. This may give you what you need. Otherwise, try rlog!

1
Entering edit mode

Hi Kevin,

I promise I won't spam you any more with this question.

I was unaware that blind = FALSE was recommended for downstream analyses, however I did try it and although the numbers changed slightly, I did have the same issue with repeated VST values across patient samples for the same gene.

I think I'll stick with rlog!

Thanks once again and all the best.

Lindsay