This tutorial is part of a series illustrating basic concepts and techniques for machine learning in R. We will try to build a classifier of relapse in breast cancer. The analysis plan will follow the general pattern (simplified) of a recent paper.
This follows from: Machine learning for cancer classification - part 1 - preparing the data sets, Machine learning for cancer classification - part 2 - Building a Random Forest Classifier and Machine learning for cancer classification - part 3 - Predicting with a Random Forest Classifier. For this tutorial, I will be referring to the test data set used in part 3 (GSE2990). Here we will take case predictions from the Random Forest Classifier, divide into low, intermediate and high risk groups and then perform survival analysis to determine whether these groups predict long-term outcome. The full script can be downloaded here.
Load the necessary packages.
Set a directory and output files.
datadir="/Users/nspies/biostar-tutorials/MachineLearning/" setwd(datadir) case_pred_outfile="testset_CasePredictions.txt" KMplotfile="KaplanMeier_TestSet_RFRS.pdf"
Assign the results of the Random Forest classifier to a local variable.
clindata_plusRF=read.table(case_pred_outfile, header = TRUE, na.strings = "NA", sep="\t")
Next, we will divide our dataset into quantiles based on the relapse risk predicted by random forests. In this example we use three groups, low, medium and high, separated evenly into thirds.
quantiles=quantile(clindata_plusRF[,"Relapse"], probs=c(0.33333,0.66667)) clindata_plusRF[,"RF_Group2"]=clindata_plusRF[,"Relapse"] clindata_plusRF[which(clindata_plusRF[,"Relapse"]<=quantiles),"RF_Group2"]="low" clindata_plusRF[which(clindata_plusRF[,"Relapse"]>quantiles & clindata_plusRF[,"Relapse"]<=quantiles),"RF_Group2"]="int" clindata_plusRF[which(clindata_plusRF[,"Relapse"]>quantiles),"RF_Group2"]="high"
We choose to rename the time column in our data to make it easier to read.
Next, we add the event column. In our case we screened out any data points which lie outside of 10 years after the starting point, as seen by the second line of code here.
At this point, clindata_plusRF contains quite a bit of information we won't use at the moment, so it helps to create a streamlined dataframe with only the pertinent information.
Take that data and create a survival object using the Surv() function. We also calculate a p-value here and format it to three significant digits.
surv_data.surv = with(surv_data, Surv(t_rfs, e_rfs_10yrcens==1)) #Calculate p-value survdifftest=survdiff(surv_data.surv ~ RF_Group2, data = surv_data) survpvalue = 1 - pchisq(survdifftest$chisq, length(survdifftest$n) - 1) survpvalue = format(as.numeric(survpvalue), digits=3)
The next code block creates a linear test p-value, using each of our risk groups as ordinal variables (L, M, H = 1, 2, 3). Then summarizes this test.
surv_data_lin=clindata_plusRF[,c("t_rfs","e_rfs_10yrcens","RF_Group2")] surv_data_lin[,"RF_Group2"]=as.vector(surv_data_lin[,"RF_Group2"]) surv_data_lin[which(surv_data_lin[,"RF_Group2"]=="low"),"RF_Group2"]=1 surv_data_lin[which(surv_data_lin[,"RF_Group2"]=="int"),"RF_Group2"]=2 surv_data_lin[which(surv_data_lin[,"RF_Group2"]=="high"),"RF_Group2"]=3 surv_data_lin[,"RF_Group2"]=as.numeric(surv_data_lin[,"RF_Group2"]) survpvalue_linear=summary(coxph(Surv(t_rfs, e_rfs_10yrcens)~RF_Group2, data=surv_data_lin))$sctest survpvalue_linear = format(as.numeric(survpvalue_linear), digits=3)
Finally, we plot our Kaplan-Meier curve using our survival object. The following code will give you a KM plot.
krfit.by_RFgroup = survfit(surv_data.surv ~ RF_Group2, data = surv_data) pdf(file=KMplotfile) colors = rainbow(5) title="Survival by RFRS - Test Set" plot(krfit.by_RFgroup, col = colors, xlab = "Time (Years)", ylab = "Relapse Free Survival", main=title, cex.axis=1.3, cex.lab=1.4) abline(v = 10, col = "black", lty = 3)
The final block of code will plot a legend to help with interpretation of the figure
groups=sort(unique(surv_data[,"RF_Group2"])) #returns unique factor levels sorted alphabetically names(colors)=groups groups_custom=c("low","int","high") colors_custom=colors[groups_custom] group_sizes_custom=table(surv_data[,"RF_Group2"])[groups_custom] groups_custom=c("Low","Intermediate","High") #Reset names legend_text=c(paste(groups_custom, " ", "(", group_sizes_custom, ")", sep=""),paste("p =", survpvalue_linear, sep=" ")) legend(x = "bottomleft", legend = legend_text, col = c(colors_custom,"white"), lty = "solid", bty="n", cex=1.2) dev.off()
As you can see in the KM plot, patients predicted by the random forests classifier to have low probability of relapse have much better relapse-free survival outcomes than patients predicted to have high probability of relapse. This is of course expected given that we trained the random forest classifier on relapse status. However, it illustrates a practical application of the machine learning exercise completed in tutorials 1 to 3. Using the Random Forest probabilities we could define a cut-off for a low-risk group that is sufficiently unlikely to have a relapse over the next 10 years to alter their treatment from aggressive chemo to watch-and-wait.
I hope this mini-tutorial has been helpful, and if you have any questions or concerns please feel free to leave a comment or message me. Thanks!