Question on hazard ratio for differential expression and survival in published TCGA analysis
1
0
Entering edit mode
4.0 years ago
GregHrambus ▴ 10

Hi all--long-time lurker, first time poster. I'm a grad student trying to replicate a published analysis of TCGA data (found here) and I had a dumb/newbie question on hazard ratios and differential expression, which then translates into a larger did-I-screw-this-up question regarding my own analysis.

In Figure 2, the authors give hazard ratios for a univariate analysis of gene expression related to survival with HR>1 indicating higher expression associated with better survival. They state that the Cox model was run on gene expression and time to recurrence, which I would have thought means that HR>1 actually suggests worse survival with increased expression, contrary to the annotation on the forest plot. Am I off base there?

I'm asking in part because I'm very close to matching their Kaplan-Meier curves for survival (Figure 3), but in my analysis the curves for high/low expression are flipped relative to the authors' such that high expression leads to worse survival. This would make sense to me if, in fact, HR>1 should mean a higher chance for recurrence with increased expression. I should note that while the survival curves are flipped in my own analysis, I've been able to somewhat closely replicate the HRs (at least in terms of >1 or <1) by using FKPM ~ days to recurrence for each candidate gene as the authors seem to have done.

To this amateur's eyes it seems like either I or the authors have something backwards. I've looked over my inputs to the graphs/models and I can't find any obvious errors in evaluating time to recurrence or high/low expression.

I'm not alleging malfeasance on the part of the authors (or malpractice by the peer reviewers)--my intuition is that I've goofed somewhere but I've spent enough time looking at this without success that I'm turning to the internet for help. Does it seem like I missed something with respect to the Cox model/hazard ratios? If so I can go back and triple check my code for the KM curves. I'm hoping to use TCGA data in some future work and if I've got it all turned around and backwards at this point I'd like to know! My goal in replicating this was mostly just to get a feel for working with the TCGA data.

TCGA RNA-Seq • 1.9k views
ADD COMMENT
2
Entering edit mode
4.0 years ago

Hey Greg,

From what I can see, higher expression of 9 of these genes is associated with better outcome - the exception is PTPRN2. This is evident in Figure 3. The Hazard Ratios (HRs) are then quoted from the perspective of low expression compared to high. So, a HR>1 means that low expression of 9/10 of the genes relates to increased risk of recurrence.

This is stated in the abstract too:

Kaplan–Meier analysis and Cox proportional hazard models showed that low expression of all the candidate genes except for PTPRN2 were associated with tumor progression and recurrence in a PCa cohort.

The radiation group's genes don't reach statistical significance, which is evident by the fact that the confidence intervals pass over the key frontier of HR==1 (Figure 2).

Finally, log2(FPKM) is not the best expression unit to use here, and leaves room for bias in the results.

I only scanned through the paper but could not see any major discrepancy in the way that the authors present the work. The only part likely to cause some confusion is the x-axis label in Figure 2.

Interesting little paper.

Kevin

ADD COMMENT
0
Entering edit mode

Thanks Kevin! I appreciate you taking a look at this. I guess it wasn't clear to me that the HRs were comparing low to high expression. I thought this was just for the log rank test and KM curves. The Fig 2 axis label also didn't help.

To clarify, if they modeled expression (measured by FPKM) as a continuous variable instead of simply high/low, would HR>1 indicate a worse prognosis? One of the things that threw me off initially is that when I was working with the same data from UCSC Xena I was able to get similar HRs (e.g. PTPRN2 HR of 0.7) by plugging in FPKM as continuous. I'll be double checking my code though now.

Re: their use of log2(FPKM), I had read one of your excellent posts on this and had hoped to eventually compare the author's results using FPKM to results using normalized data via e.g. DESeq2. As you can see of course I haven't got there yet.

Thanks again!

ADD REPLY
1
Entering edit mode

I did not check how they encoded the gene expression, but I would normally encode them into tertiles or quartiiles, and use them categorically. Looking at their survival curves again, it seems certain that they have encoded the gene expression ranges into binary high|low - this must be in their methods?

Generally, though, if you leave it as continuous, then, yes, it will quote the HR, initially, from the perspective of increased gene expression. However, in the case of the Cox PH model,, it will also output the beta coefficients and HRs from the perspective of low expression, too.

In this tutorial, I actually do an initial 'sweep' over the transcriptome encoded continuously, and then encode the statistically significant findings into low|mid|high, where I then test them again: Survival analysis with gene expression

ADD REPLY
1
Entering edit mode

Thanks! I'll check out the tutorial.

ADD REPLY

Login before adding your answer.

Traffic: 2664 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6