Tutorial: Machine Learning For Prediction of Relapse in Cancer - Part 2 - Building A Random Forest Classifier
65
gravatar for Nicholas Spies
5.7 years ago by
Nicholas Spies1.1k
United States, St. Louis, The Genome Institute at WUSTL
Nicholas Spies1.1k wrote:

This tutorial is part of a series illustrating basic concepts and techniques for machine learning in R. We will try to build a classifier of relapse in breast cancer. The analysis plan will follow the general pattern (simplified) of a recent paper.

This follows from: Machine learning for cancer classification - part 1 - preparing the data sets. This part covers how to build a Random Forest classification model to predict relapse in breast cancer from microarray expression data. We make the assumption that the data sets will be in the correct format as produced in part 1. The data sets in this tutorial are the 'training data' from the prior tutorial, retrieved from GSE2034. In a subsequent tutorial we will apply the classifier built here to the 'test data' (GSE2990) also downloaded in part 1. To avoid the hassle of Copy-Pasting every block of code, the full script can be downloaded here. But, first, let's review the basic principles of the Random Forests method.

Figure 1. A Random Forest is built one tree at a time.

A Random Forest is a collection of decision trees. Each tree gets a "vote" in classifying. There are two components of randomness involved in the building of a Random Forest. First, at the creation of each tree, a random subsample of the total data set is selected to grow the tree. Second, at each node of the tree, a well-performing gene from a random subset of all genes is chosen as a "splitter variable". The splitter variable attempts to separate patients in one class (e.g., Response) from those in the other class (e.g., Non-Response). The tree is grown with additional splitter variables until all terminal nodes (leaves) of the tree are purely one class or the other. This tree is then "tested" against the 1/3 of patients set aside, the "out of bag" (OOB) patients. Each OOB patient traverses the tree, going down one branch or another depending on his/her gene expression values for each splitter variable. These OOB patients are assigned a predicted class based on where they land in the tree (a vote). The entire process is repeated with new random divisions into 2/3 and 1/3 patient sets and new random gene sets for selection of splitter variables to produce additional trees and ultimately a forest. In each case a different subset of patients is used to build the tree and test its performance. At the end, each patient will have contributed to the construction of ~2/3 of all trees and been tested in the other ~1/3. Each "test" tree gives a vote for whether the patient will relapse or not relapse. The fraction of votes for relapse is an estimate of the probability of relapse and all patients will be predicted as either a relapse or non-relapse (using probability of 0.5 as the threshold). By comparing these predictions based on the OOB data to their known class, estimates of the accuracy of the overall forest can be obtained. The forest can then also be applied to independent test data or patients of unknown class (see next Figure 2).

Figure 2. To predict new patients, each tree gets a vote...

Figure 3. Variable importance is a feature of random forests

Now, let's proceed with the exercises. Install and load the necessary packages (if not already installed).

install.packages("randomForest")
install.packages("ROCR")
install.packages("Hmisc")
source("http://bioconductor.org/biocLite.R")
biocLite("genefilter")

library(randomForest)
library(ROCR)
library(genefilter)
library(Hmisc)

Set the working directory and file names for Input/output

setwd("/Users/ogriffit/git/biostar-tutorials/MachineLearning")
datafile="trainset_gcrma.txt" 
clindatafile="trainset_clindetails.txt"
outfile="trainset_RFoutput.txt"
varimp_pdffile="trainset_varImps.pdf"
MDS_pdffile="trainset_MDS.pdf"
ROC_pdffile="trainset_ROC.pdf"
case_pred_outfile="trainset_CasePredictions.txt"
vote_dist_pdffile="trainset_vote_dist.pdf"

Next we will read in the data sets (expecting a tab-delimited file with header line and rownames). These were the outputs from the previous tutorial mentioned above. We also need to rearrange the clinical data so that it will be in the same order as the GCRMA expression data.

data_import=read.table(datafile, header = TRUE, na.strings = "NA", sep="\t")
clin_data_import=read.table(clindatafile, header = TRUE, na.strings = "NA", sep="\t")
clin_data_order=order(clin_data_import[,"GEO.asscession.number"])
clindata=clin_data_import[clin_data_order,]
data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file
rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above
header=colnames(rawdata)
rawdata=rawdata[which(!is.na(rawdata[,3])),] #Remove rows with missing gene symbol

Next we filter out any variables (genes) that are not expressed or do not have enough variance to be informative in classification. We will first take the values and un-log2 them, then filter out any genes according to following criteria (recommended in multtest/MTP documentation): (1) At least 20% of samples should have raw intensity greater than 100; (2) The coefficient of variation (sd/mean) is between 0.7 and 10.

X=rawdata[,4:length(header)]
ffun=filterfun(pOverA(p = 0.2, A = 100), cv(a = 0.7, b = 10))
filt=genefilter(2^X,ffun)
filt_Data=rawdata[filt,]

We will assign the variables that pass this filtering process to a new data structure. Extract just the expression values from the filtered data and transpose the matrix. The latter is necessary because RandomForests expects the predictor variables (genes) to be represented as columns instead of rows. Finally, assign gene symbol as the predictor name.

#Get potential predictor variables
predictor_data=t(filt_Data[,4:length(header)])
predictor_names=c(as.vector(filt_Data[,3])) #gene symbol
colnames(predictor_data)=predictor_names

As a final step before the Random Forest classification, we have to set the variable we are trying to predict as our target variable. In this case, it is relapse status.

target= clindata[,"relapse..1.True."]
target[target==0]="NoRelapse"
target[target==1]="Relapse"
target=as.factor(target)

Finally we run the RF algorithm. NOTE: we use an ODD number for ntree. This is because when the forest/ensembl is used on test data, ties are broken randomly. Having an odd number of trees avoids this issue and makes the model fully deterministic. Also note, we will use down-sampling to attempt to compensate for unequal class-sizes (less relapses than non-relapses).

tmp = as.vector(table(target))
num_classes = length(tmp)
min_size = tmp[order(tmp,decreasing=FALSE)[1]]
sampsizes = rep(min_size,num_classes)
rf_output=randomForest(x=predictor_data, y=target, importance = TRUE, ntree = 10001, proximity=TRUE, sampsize=sampsizes, na.action = na.omit)

The final blocks of code produce various forms of output for analysis of the classifier, and final classification results. First, save the RF classifier with save(). This allows you to later load the saved model. This can be useful if you wish to rerun later parts of this script without the time-consuming model building. More importantly though it will allow you to apply the model you have built to new, independent samples for classification purposes.

save(rf_output, file="RF_model")
load("RF_model")

RandomForest calculates an importance measures for each variable. Let's save them to a new object for later use:

rf_importances=importance(rf_output, scale=FALSE)

The following lines will give an overview of the classifier's performance. Specifically, they will generate a confusion table to allow calculation of sensitivity, specificity, accuracy, etc.

confusion=rf_output$confusion
sensitivity=(confusion[2,2]/(confusion[2,2]+confusion[2,1]))*100
specificity=(confusion[1,1]/(confusion[1,1]+confusion[1,2]))*100
overall_error=rf_output$err.rate[length(rf_output$err.rate[,1]),1]*100
overall_accuracy=1-overall_error
class1_error=paste(rownames(confusion)[1]," error rate= ",confusion[1,3], sep="")
class2_error=paste(rownames(confusion)[2]," error rate= ",confusion[2,3], sep="")
overall_accuracy=100-overall_error

Next we will prepare each useful statistic for writing to an output file

sens_out=paste("sensitivity=",sensitivity, sep="")
spec_out=paste("specificity=",specificity, sep="")
err_out=paste("overall error rate=",overall_error,sep="")
acc_out=paste("overall accuracy=",overall_accuracy,sep="")
misclass_1=paste(confusion[1,2], rownames(confusion)[1],"misclassified as", colnames(confusion)[2], sep=" ")
misclass_2=paste(confusion[2,1], rownames(confusion)[2],"misclassified as", colnames(confusion)[1], sep=" ")
confusion_out=confusion[1:2,1:2]
confusion_out=cbind(rownames(confusion_out), confusion_out)

Finally, we print all of these to an output file. Note, we will be appending with multiple writes to the same file. This may generate a warning.

write.table(rf_importances[,4],file=outfile, sep="\t", quote=FALSE, col.names=FALSE)
write("confusion table", file=outfile, append=TRUE)
write.table(confusion_out,file=outfile, sep="\t", quote=FALSE, col.names=TRUE, row.names=FALSE, append=TRUE)
write(c(sens_out,spec_out,acc_out,err_out,class1_error,class2_error,misclass_1,misclass_2), file=outfile, append=TRUE)

For a simple visualization, we create a representation of the top 30 variables categorized by importance.

pdf(file=varimp_pdffile)
varImpPlot(rf_output, type=2, n.var=30, scale=FALSE, main="Variable Importance (Gini) for top 30 predictors")
dev.off()

An MDS plot provides a sense of the separation of classes.

pdf(file=MDS_pdffile)
target_labels=as.vector(target)
target_labels[target_labels=="NoRelapse"]="N"
target_labels[target_labels=="Relapse"]="R"
MDSplot(rf_output, target, k=2, xlab="", ylab="", pch=target_labels, palette=c("red", "blue"), main="MDS plot")
dev.off()

A common method of assessing a classifier's performance is to create an ROC curve and calculate the area under it (AUC). We use the relapse/non-relapse vote fractions as predictive variable. The ROC curve is generated by stepping through different thresholds for calling relapse vs non-relapse.

predictions=as.vector(rf_output$votes[,2])
pred=prediction(predictions,target)
#First calculate the AUC value
perf_AUC=performance(pred,"auc")
AUC=perf_AUC@y.values[[1]]
#Then, plot the actual ROC curve
perf_ROC=performance(pred,"tpr","fpr")
pdf(file=ROC_pdffile)
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(AUC, digits=5, scientific=FALSE)))
dev.off()

Produce a back-to-back histogram of vote distributions for Relapse and NoRelapse.

options(digits=2)
pdf(file=vote_dist_pdffile)
out <- histbackback(split(rf_output$votes[,"Relapse"], target), probability=FALSE, xlim=c(-50,50), main = 'Vote distributions for patients classified by RF', axes=TRUE, ylab="Fraction votes (Relapse)")
barplot(-out$left, col="red" , horiz=TRUE, space=0, add=TRUE, axes=FALSE)
barplot(out$right, col="blue", horiz=TRUE, space=0, add=TRUE, axes=FALSE)
dev.off()

Finally, we save our case predictions

case_predictions=cbind(clindata,target,rf_output$predicted,rf_output$votes)
write.table(case_predictions,file=case_pred_outfile, sep="\t", quote=FALSE, col.names=TRUE, row.names=FALSE)

After running this script you have generated a Random Forest Classifier of Relapse for breast cancer Affy data. Next, we will apply this classifier to the independent test data set. See Machine learning for cancer classification - part 3 - Predicting with a Random Forest Classifier. You also have a case_predictions file on which you can perform survival analysis, which will be the subject of a later tutorial. See Machine learning for cancer classification - part 4 - Plotting a Kaplan-Meier Curve for Survival Analysis.

ADD COMMENTlink modified 11 months ago by RamRS23k • written 5.7 years ago by Nicholas Spies1.1k

Hello Mr Griffith,it was as absolute pleasure going through your tutorial(which btw is a rigorous introduction to ML aplications using microarray data)!

I wanted to clarify something,can you please explain to me the meaning of the below mentioned code snippet and what its doing?

data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file
rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above

I got the order() part but why the "without first three columns" and then adding 3 to get correct index?

same question for rawdata.why consider the first 3 columns and then reamining in order as above?(I guess this part will be clear automatically if I understand the first part)

Thanks once again for this wonderful tut!

Plz answer ASAP.Thanks for that too,in advance!

Shayantan

ADD REPLYlink modified 11 months ago by RamRS23k • written 4.1 years ago by banerjeeshayantan130

I'd avoid the "ASAP" part of the request - it's not really a good sign in forums.

ADD REPLYlink written 4.1 years ago by RamRS23k

,@Ram I have my exams in a couple of days and hence the bad "sign" so to speak !!..Well now that you have gone through my question here,would you help by addressing the issue?

I still dont really get that part!Would love to hear from Mr Griffith though!

 

ADD REPLYlink written 4.1 years ago by banerjeeshayantan130

Hello, 

The first three columns in the data set are non-numerical identifiers. We remove these to simplify the rest of the data-frame manipulations. Adding three to the index just makes sure gene order stays as intended.

Hope this helps!

Nick

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Nicholas Spies1.1k

I have a doubt, please correct me if I am wrong or missing anything. I found that, from the list of top 30 predictors, the gene "MLF1IP" is not present in the dataset GSE2034. As the list of predictors is generated from the original dataset only then how the above gene is not present in the dataset.  

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by swapnil900

Dear Nicholas Spies,

i would like to ask some important questions regarding this specific part of the machine learning procedure with random forests in R, as i want to implement it to my current project. In detail, in contrast to the first part of the tutorial, my 2 datasets comprise of two colorectal cancer affymetrix datasets, but of different platform: hgu133a and hgu133plus2. Thus, 

1. how i could deal with this problem, regarding the test and train set to build a classifier, in order to separate primary colon tumor samples from adjucent control samples(paired/each patient has two samples)? I have thought a methodology, which i have posted in a fork of Biostar [im just posting the link if anyone wants further information----Bioconductor Support group---- (https://support.bioconductor.org/p/69669/)], in which i used the inSilicoMerging package to merge the two datasets based on their common probeIDs(~22,000). I understand that this have various pitfalls(regarding batch effects), but as the package performs batch effect correction when it merges the packages-and also from the pca and plotMDS functions the samples seem to separate well. So, in this case  if i proceed, how i could implement the train and test dataset? Maybe with the caret package and the function createDataPartition ? Or it is irrelevant ?

2. Secondly, should i reduce the number of the input for the construction of the classifier, regarding the number of the probesets? i.e use only a subgroup from DEG genes resulting from this specific statistical analysis prior the classification procedure ?

*Please excuse me for the disturbance or for any naive questions, but im using R for the last 7 months and this tutorial is very useful and crusial to implement in my analysis, in order to compare with some other methodologies i have implement regarding the selection of a subset of possible candidates-important genes(biomarkers for further validation), any help or advise on this matter would be grateful !!

Efstathios-Iason

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by svlachavas570

Can you clarify how best splitter (gene) is chosen in a random forest? is it base on an expression threshold? if so , how is the threshold calculated?

ADD REPLYlink written 3.7 years ago by CHANG40

Hi Nicolas,

I asked Obi the same question, but not sure who responds quicker. I have RNA-Seq data sets. How would I load them in to run Random forest? Would I need to use a completely different approach?

Many thanks

ADD REPLYlink written 3.3 years ago by helen.smith-20
1

please ask this as a new question and not as an answer to this post.

ADD REPLYlink written 3.3 years ago by Istvan Albert ♦♦ 81k

Hi, first of all thanks for this tutorial, it's done very well. I have the following suggestione for the Random Forest statement :

rf_output=randomForest(x=predictor_data, y=target, importance = TRUE, ntree = 10001, proximity=TRUE, sampsize=sampsizes, na.action = na.omit), to filter the NA values.

With this addition I've solved the problems in the Part. 3 of this tutorial.

ADD REPLYlink modified 11 months ago by RamRS23k • written 12 months ago by m.colonna10

Thanks for the input! It's a great thought and I will add it to the example.

ADD REPLYlink written 11 months ago by Nicholas Spies1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2015 users visited in the last hour