Tutorial:Machine Learning For Prediction of Relapse in Cancer - Part 2 - Building A Random Forest Classifier
Entering edit mode
10.3 years ago
Nicholas Spies ★ 1.2k

This tutorial is part of a series illustrating basic concepts and techniques for machine learning in R. We will try to build a classifier of relapse in breast cancer. The analysis plan will follow the general pattern (simplified) of a recent paper.

This follows from: Machine learning for cancer classification - part 1 - preparing the data sets. This part covers how to build a Random Forest classification model to predict relapse in breast cancer from microarray expression data. We make the assumption that the data sets will be in the correct format as produced in part 1. The data sets in this tutorial are the 'training data' from the prior tutorial, retrieved from GSE2034. In a subsequent tutorial we will apply the classifier built here to the 'test data' (GSE2990) also downloaded in part 1. To avoid the hassle of Copy-Pasting every block of code, the full script can be downloaded here. But, first, let's review the basic principles of the Random Forests method.

Figure 1. A Random Forest is built one tree at a time.

A Random Forest is a collection of decision trees. Each tree gets a "vote" in classifying. There are two components of randomness involved in the building of a Random Forest. First, at the creation of each tree, a random subsample of the total data set is selected to grow the tree. Second, at each node of the tree, a well-performing gene from a random subset of all genes is chosen as a "splitter variable". The splitter variable attempts to separate patients in one class (e.g., Response) from those in the other class (e.g., Non-Response). The tree is grown with additional splitter variables until all terminal nodes (leaves) of the tree are purely one class or the other. This tree is then "tested" against the 1/3 of patients set aside, the "out of bag" (OOB) patients. Each OOB patient traverses the tree, going down one branch or another depending on his/her gene expression values for each splitter variable. These OOB patients are assigned a predicted class based on where they land in the tree (a vote). The entire process is repeated with new random divisions into 2/3 and 1/3 patient sets and new random gene sets for selection of splitter variables to produce additional trees and ultimately a forest. In each case a different subset of patients is used to build the tree and test its performance. At the end, each patient will have contributed to the construction of ~2/3 of all trees and been tested in the other ~1/3. Each "test" tree gives a vote for whether the patient will relapse or not relapse. The fraction of votes for relapse is an estimate of the probability of relapse and all patients will be predicted as either a relapse or non-relapse (using probability of 0.5 as the threshold). By comparing these predictions based on the OOB data to their known class, estimates of the accuracy of the overall forest can be obtained. The forest can then also be applied to independent test data or patients of unknown class (see next Figure 2).


Figure 2. To predict new patients, each tree gets a vote...


Figure 3. Variable importance is a feature of random forests


Now, let's proceed with the exercises. Install and load the necessary packages (if not already installed).



Set the working directory and file names for Input/output


Next we will read in the data sets (expecting a tab-delimited file with header line and rownames). These were the outputs from the previous tutorial mentioned above. We also need to rearrange the clinical data so that it will be in the same order as the GCRMA expression data.

data_import=read.table(datafile, header = TRUE, na.strings = "NA", sep="\t")
clin_data_import=read.table(clindatafile, header = TRUE, na.strings = "NA", sep="\t")
data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file
rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above
rawdata=rawdata[which(!is.na(rawdata[,3])),] #Remove rows with missing gene symbol

Next we filter out any variables (genes) that are not expressed or do not have enough variance to be informative in classification. We will first take the values and un-log2 them, then filter out any genes according to following criteria (recommended in multtest/MTP documentation): (1) At least 20% of samples should have raw intensity greater than 100; (2) The coefficient of variation (sd/mean) is between 0.7 and 10.

ffun=filterfun(pOverA(p = 0.2, A = 100), cv(a = 0.7, b = 10))

We will assign the variables that pass this filtering process to a new data structure. Extract just the expression values from the filtered data and transpose the matrix. The latter is necessary because RandomForests expects the predictor variables (genes) to be represented as columns instead of rows. Finally, assign gene symbol as the predictor name.

#Get potential predictor variables
predictor_names=c(as.vector(filt_Data[,3])) #gene symbol

As a final step before the Random Forest classification, we have to set the variable we are trying to predict as our target variable. In this case, it is relapse status.

target= clindata[,"relapse..1.True."]

Finally we run the RF algorithm. NOTE: we use an ODD number for ntree. This is because when the forest/ensembl is used on test data, ties are broken randomly. Having an odd number of trees avoids this issue and makes the model fully deterministic. Also note, we will use down-sampling to attempt to compensate for unequal class-sizes (less relapses than non-relapses).

tmp = as.vector(table(target))
num_classes = length(tmp)
min_size = tmp[order(tmp,decreasing=FALSE)[1]]
sampsizes = rep(min_size,num_classes)
rf_output=randomForest(x=predictor_data, y=target, importance = TRUE, ntree = 10001, proximity=TRUE, sampsize=sampsizes, na.action = na.omit)

The final blocks of code produce various forms of output for analysis of the classifier, and final classification results. First, save the RF classifier with save(). This allows you to later load the saved model. This can be useful if you wish to rerun later parts of this script without the time-consuming model building. More importantly though it will allow you to apply the model you have built to new, independent samples for classification purposes.

save(rf_output, file="RF_model")

RandomForest calculates an importance measures for each variable. Let's save them to a new object for later use:

rf_importances=importance(rf_output, scale=FALSE)

The following lines will give an overview of the classifier's performance. Specifically, they will generate a confusion table to allow calculation of sensitivity, specificity, accuracy, etc.

class1_error=paste(rownames(confusion)[1]," error rate= ",confusion[1,3], sep="")
class2_error=paste(rownames(confusion)[2]," error rate= ",confusion[2,3], sep="")

Next we will prepare each useful statistic for writing to an output file

sens_out=paste("sensitivity=",sensitivity, sep="")
spec_out=paste("specificity=",specificity, sep="")
err_out=paste("overall error rate=",overall_error,sep="")
acc_out=paste("overall accuracy=",overall_accuracy,sep="")
misclass_1=paste(confusion[1,2], rownames(confusion)[1],"misclassified as", colnames(confusion)[2], sep=" ")
misclass_2=paste(confusion[2,1], rownames(confusion)[2],"misclassified as", colnames(confusion)[1], sep=" ")
confusion_out=cbind(rownames(confusion_out), confusion_out)

Finally, we print all of these to an output file. Note, we will be appending with multiple writes to the same file. This may generate a warning.

write.table(rf_importances[,4],file=outfile, sep="\t", quote=FALSE, col.names=FALSE)
write("confusion table", file=outfile, append=TRUE)
write.table(confusion_out,file=outfile, sep="\t", quote=FALSE, col.names=TRUE, row.names=FALSE, append=TRUE)
write(c(sens_out,spec_out,acc_out,err_out,class1_error,class2_error,misclass_1,misclass_2), file=outfile, append=TRUE)

For a simple visualization, we create a representation of the top 30 variables categorized by importance.

varImpPlot(rf_output, type=2, n.var=30, scale=FALSE, main="Variable Importance (Gini) for top 30 predictors")

Figure-4: Simple Visualization

An MDS plot provides a sense of the separation of classes.

MDSplot(rf_output, target, k=2, xlab="", ylab="", pch=target_labels, palette=c("red", "blue"), main="MDS plot")

Figure-5: Training set MDS

A common method of assessing a classifier's performance is to create an ROC curve and calculate the area under it (AUC). We use the relapse/non-relapse vote fractions as predictive variable. The ROC curve is generated by stepping through different thresholds for calling relapse vs non-relapse.

#First calculate the AUC value
#Then, plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(AUC, digits=5, scientific=FALSE)))

Figure-6: Training Set ROC

Produce a back-to-back histogram of vote distributions for Relapse and NoRelapse.

out <- histbackback(split(rf_output$votes[,"Relapse"], target), probability=FALSE, xlim=c(-50,50), main = 'Vote distributions for patients classified by RF', axes=TRUE, ylab="Fraction votes (Relapse)")
barplot(-out$left, col="red" , horiz=TRUE, space=0, add=TRUE, axes=FALSE)
barplot(out$right, col="blue", horiz=TRUE, space=0, add=TRUE, axes=FALSE)

Figure-7:Training Set vote dist

Finally, we save our case predictions

write.table(case_predictions,file=case_pred_outfile, sep="\t", quote=FALSE, col.names=TRUE, row.names=FALSE)

After running this script you have generated a Random Forest Classifier of Relapse for breast cancer Affy data. Next, we will apply this classifier to the independent test data set. See Machine learning for cancer classification - part 3 - Predicting with a Random Forest Classifier. You also have a case_predictions file on which you can perform survival analysis, which will be the subject of a later tutorial. See Machine learning for cancer classification - part 4 - Plotting a Kaplan-Meier Curve for Survival Analysis.

r cancer classification • 67k views
Entering edit mode

Hello Mr Griffith,it was as absolute pleasure going through your tutorial(which btw is a rigorous introduction to ML applications using microarray data)!

I wanted to clarify something,can you please explain to me the meaning of the below mentioned code snippet and what its doing?

data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file
rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above

I got the order() part but why the "without first three columns" and then adding 3 to get correct index?

Same question for rawdata. Why consider the first 3 columns and then remaining in order as above?(I guess this part will be clear automatically if I understand the first part)

Thanks once again for this wonderful tutorial!

Please answer ASAP. Thanks for that too, in advance!


Entering edit mode

I'd avoid the "ASAP" part of the request - it's not really a good sign in forums.

Entering edit mode

@Ram I have my exams in a couple of days and hence the bad "sign" so to speak !!..Well now that you have gone through my question here,would you help by addressing the issue?

I still don't really get that part! Would love to hear from Mr Griffith though!

Entering edit mode


The first three columns in the data set are non-numerical identifiers. We remove these to simplify the rest of the data-frame manipulations. Adding three to the index just makes sure gene order stays as intended.

Hope this helps!


Entering edit mode

I have a doubt, please correct me if I am wrong or missing anything. I found that, from the list of top 30 predictors, the gene "MLF1IP" is not present in the dataset GSE2034. As the list of predictors is generated from the original dataset only then how the above gene is not present in the dataset.

Entering edit mode

Dear Nicholas Spies,

I would like to ask some important questions regarding this specific part of the machine learning procedure with random forests in R, as I want to implement it to my current project. In detail, in contrast to the first part of the tutorial, my 2 datasets comprise of two colorectal cancer affymetrix datasets, but of different platform: hgu133a and hgu133plus2. Thus,

  1. How I could deal with this problem, regarding the test and train set to build a classifier, in order to separate primary colon tumor samples from adjacent control samples(paired/each patient has two samples)? I have thought a methodology, which I have posted in a fork of Biostar [I'm just posting the link if anyone wants further information----Bioconductor Support group---- (https://support.bioconductor.org/p/69669/)], in which I used the inSilicoMerging package to merge the two datasets based on their common probeIDs(~22,000). I understand that this have various pitfalls(regarding batch effects), but as the package performs batch effect correction when it merges the packages-and also from the pca and plotMDS functions the samples seem to separate well. So, in this case if I proceed, how I could implement the train and test dataset? Maybe with the caret package and the function createDataPartition? Or it is irrelevant?
  2. Secondly, should I reduce the number of the input for the construction of the classifier, regarding the number of the probesets? i.e use only a subgroup from DEG genes resulting from this specific statistical analysis prior the classification procedure ?

Please excuse me for the disturbance or for any naive questions, but I'm using R for the last 7 months and this tutorial is very useful and crucial to implement in my analysis, in order to compare with some other methodologies I have implement regarding the selection of a subset of possible candidates-important genes(biomarkers for further validation), any help or advise on this matter would be grateful!


Entering edit mode

Can you clarify how best splitter (gene) is chosen in a random forest? Is it based on an expression threshold? If so, how is the threshold calculated?

Entering edit mode

Hi Nicolas,

I asked Obi the same question, but not sure who responds quicker. I have RNA-Seq data sets. How would I load them in to run Random forest? Would I need to use a completely different approach?

Many thanks

Entering edit mode

please ask this as a new question and not as an answer to this post.

Entering edit mode

Hi, first of all thanks for this tutorial, it's done very well. I have the following suggestione for the Random Forest statement :

rf_output=randomForest(x=predictor_data, y=target, importance = TRUE, ntree = 10001, proximity=TRUE, sampsize=sampsizes, na.action = na.omit), to filter the NA values.

With this addition I've solved the problems in the Part. 3 of this tutorial.

Entering edit mode

Thanks for the input! It's a great thought and I will add it to the example.


Login before adding your answer.

Traffic: 1571 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6