I am working on a project that seeks to identify potential biomarkers in neuroblastoma by using the tumor tissue expression data for 249 patients.

For my visualization/data analysis, I know that I can obviously use Kaplan-Meier curves to show the possibility/time of survival in relation to gene expression levels, but I was looking for suggestions on what other sorts of graphs/curves I could use to represent my results (I don't want to have merely multiple Kaplan-Meier graphs on my poster board when presenting the project). I was thinking about using ROC curves, but I am still not sure that I fully understand the reasoning behind ROC curves and would also really appreciate it if someone could also explain ROC curves and whether it would be a good fit for my project.

I sometimes obtain z scores from Cox regression and then just plot those using whatever graph is convenient. Each gene gets a z score (because you run Cox regression on each gene you're interested in).

For example, say there are 100 genes out of 20000 genes that you are interested in. I might make an empirical cumulative distribution function (eCDF) plot of the z-scores of those 100 genes overlayed with the eCDF plot of the z-scores of the reference distribution (all 20000 genes).

ROC curves are true positive rate vs. false positive rate curves. Say you have a test that predicts whether someone has a disease based on some score. If the test score is greater than x, they test positive for the disease. Otherwise, they test negative for the disease. This test isn't going be perfectly accurate. You plot the true positive rate of the test and the false positive rate of the test for a given value of x. But wait, what is this threshold x? Well, you try all different values of x, and plot what the true positive rate and false positive rate for each value of x is (what if x is a big number? what is x is a small number? Etc.). Voila, you get an ROC curve!

Since you mentioned z-scores, I had a quick question. My data is RMA transformed/normalized (like this: http://www.molmine.com/magma/loading/rma.htm). When I attempted to calculate the global z-scores of my RMA normalized data, I was not getting a mean value of 0 (z-score values were not being calculated properly). I think that because my data is highly skewed and already quantile normalized, perhaps that is why it is not working?

How would I be able to make these graphs without calculating z-scores? Is there some other way that I can set cut off values and establish what is high/low expression using just the RMA data (perhaps by using the standard deviation or the median)?

I'm not familiar with RMA, what does the distribution of your data look like? Maybe try log2-transforming your final data to reduce skew?

I'd recommend just trying to fix the distribution of the data -- otherwise, I'm going to believe that the statistical assumptions of your survival analysis have been violated.

If you're going to apply an arbitrary high/low cutoff, you can do Kaplan-Meier analysis but report the results in some other way than showing survival curves? Kaplan Meier gives you statistics that you can plot (e.g. hazard ratio, p-values)

I have done a Cox regression and found hazard ratios/p-values for the genes I want to graph, but I am not sure how I can put that on a curve without knowing what constitutes of high/low expression.
Here is what the values of my data look like in general:

High vs. Low is based off arbitrary thresholds. Cox allows you to look at how expression relates to prognosis without needing a binary high/low threshold.

Did you find the Cox results for EVERY gene? (Not just the ones you're interested in). Plot that distribution.

Also, plot the distribution of the expression of the data so you can figure out how it is skewed

Thanks for the response! Sorry for all the questions but - I did find cox results for every gene, and by plotting this do you mean plotting the distribution on something like this https://www.itl.nist.gov/div898/handbook/eda/section3/eda336.htm
(a Box-Cox plot)?

Sure, that works, or you can just simply make a histogram.

It's important to see what your data looks like.

In general, you should always look at how your data is distributed before doing any downstream analysis on it.

Honestly, your main issue was your Cox z-scores not having a mean of 0. That's fine -- real data isn't ever going to give you a mean of exactly 0 -- I mainly want to know that the distribution looks close to normal when you plot it.

I made histograms (thank you for the suggestion), and many of the genes seem to have a close to normal distribution (multiple are skewed though, none are perfect obviously). Will this skewedness in certain genes cause any issues in my analysis? Thanks for the help!

I'd imagine that most genes should be close enough to normal (or log-normal), so it should be fine.

I'd just go with that -- after all, you're just trying to show the prognostic value of your biomarkers. Trying to make every single gene fit the assumptions of the Cox model better is not worth the effort (and I honestly can't think of an easy way to go about doing so).

Thanks for the very helpful reply!

Since you mentioned z-scores, I had a quick question. My data is RMA transformed/normalized (like this: http://www.molmine.com/magma/loading/rma.htm). When I attempted to calculate the global z-scores of my RMA normalized data, I was not getting a mean value of 0 (z-score values were not being calculated properly). I think that because my data is highly skewed and already quantile normalized, perhaps that is why it is not working?

How would I be able to make these graphs without calculating z-scores? Is there some other way that I can set cut off values and establish what is high/low expression using just the RMA data (perhaps by using the standard deviation or the median)?

Thanks so much for your help!

I'm not familiar with RMA, what does the distribution of your data look like? Maybe try log2-transforming your final data to reduce skew?

I'd recommend just trying to fix the distribution of the data -- otherwise, I'm going to believe that the statistical assumptions of your survival analysis have been violated.

If you're going to apply an arbitrary high/low cutoff, you can do Kaplan-Meier analysis but report the results in some other way than showing survival curves? Kaplan Meier gives you statistics that you can plot (e.g. hazard ratio, p-values)

I have done a Cox regression and found hazard ratios/p-values for the genes I want to graph, but I am not sure how I can put that on a curve without knowing what constitutes of high/low expression. Here is what the values of my data look like in general:

AADACL3 5.26498 AADACL4 5.04068 ACADM 7.54957 ACAP3 7.87952 ACOT11 6.73529 ACOT7 8.018 ACTB 11.1586 ACTL8 5.99607 ACTRT2 5.60795 ADC 6.47824 ADPRHL2 6.84182 …..

According to http://www.molmine.com/magma/loading/rma.htm my RMA data is already log-2 transformed, so I am not sure how I can fix the distribution of my data.

Thanks!

High vs. Low is based off arbitrary thresholds. Cox allows you to look at how expression relates to prognosis without needing a binary high/low threshold.

Did you find the Cox results for EVERY gene? (Not just the ones you're interested in). Plot that distribution.

Also, plot the distribution of the expression of the data so you can figure out how it is skewed

Thanks for the response! Sorry for all the questions but - I did find cox results for every gene, and by plotting this do you mean plotting the distribution on something like this https://www.itl.nist.gov/div898/handbook/eda/section3/eda336.htm (a Box-Cox plot)?

Sure, that works, or you can just simply make a histogram.

It's important to see what your data looks like.

In general, you should always look at how your data is distributed before doing any downstream analysis on it.

Honestly, your main issue was your Cox z-scores not having a mean of 0. That's fine -- real data isn't ever going to give you a mean of exactly 0 -- I mainly want to know that the distribution looks close to normal when you plot it.

I made histograms (thank you for the suggestion), and many of the genes seem to have a close to normal distribution (multiple are skewed though, none are perfect obviously). Will this skewedness in certain genes cause any issues in my analysis? Thanks for the help!

I'd imagine that most genes should be close enough to normal (or log-normal), so it should be fine.

I'd just go with that -- after all, you're just trying to show the prognostic value of your biomarkers. Trying to make every single gene fit the assumptions of the Cox model better is not worth the effort (and I honestly can't think of an easy way to go about doing so).