Tutorial: Machine Learning For Cancer Classification - Part 3 - Predicting With A Random Forest Classifier
25
gravatar for Obi Griffith
4.3 years ago by
Obi Griffith16k
Washington University, St Louis, USA
Obi Griffith16k wrote:

This tutorial is part of a series illustrating basic concepts and techniques for machine learning in R. We will try to build a classifier of relapse in breast cancer. The analysis plan will follow the general pattern (simplified) of a recent paper.

This follows from: Machine learning for cancer classification - part 1 - preparing the data sets and Machine learning for cancer classification - part 2 - Building a Random Forest Classifier. This part covers how to use a pre-computed Random Forest classification model to predict relapse in new samples of breast cancer. We make the assumption that the data sets will be in the correct format as produced in part 1. The data sets in this tutorial are the 'test data' from the prior tutorial, retrieved from GSE2990. To avoid the hassle of Copy-Pasting every block of code, the full script can be downloaded here.

Load the necessary packages.

library(randomForest)
library(ROCR)
require(Hmisc)

Set working directory and filenames for Input/output

setwd("/Users/ogriffit/git/biostar-tutorials/MachineLearning")
RF_model_file="RF_model"
datafile="testset_gcrma.txt"
clindatafile="testset_clindetails.txt"
outfile="testset_RFoutput.txt"
case_pred_outfile="testset_CasePredictions.txt"
ROC_pdffile="testset_ROC.pdf"
vote_dist_pdffile="testset_vote_dist.pdf"

Next we will read in the data sets (expecting a tab-delimited file with header line and rownames). We also need to rearrange the clinical data so that it will be in the same order as the GCRMA expression data.

data_import=read.table(datafile, header = TRUE, na.strings = "NA", sep="\t")
clin_data_import=read.table(clindatafile, header = TRUE, na.strings = "NA", sep="\t")
clin_data_order=order(clin_data_import[,"geo_accn"])
clindata=clin_data_import[clin_data_order,]
data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file
rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above
header=colnames(rawdata)

Extract just the expression values from the filtered data and transpose the matrix.

predictor_data=t(rawdata[,4:length(header)])
predictor_names=as.vector((rawdata[,3])) #gene symbol
colnames(predictor_data)=predictor_names

Load the RandomForests classifier from file (object "rf_output" which was saved previously)

load(file=RF_model_file)

Extract predictor (gene) names from RF model built in the previous tutorial and subset the test data to just these predictors. This is not strictly necessary as the randomForest predict function would automatically restrict itself to these variables. It has been added for clarity.

RF_predictor_names=rownames(rf_output$importance)
predictor_data=predictor_data[,RF_predictor_names]

Run the test data through forest!

RF_predictions_responses=predict(rf_output, predictor_data, type="response")
RF_predictions_votes=predict(rf_output, predictor_data, type="vote")

Join predictions with clinical data and exclude rows with missing clinical data needed for subsequent steps. Then, write results to file.

clindata_plusRF=cbind(clindata,RF_predictions_responses,RF_predictions_votes)
clindata_plusRF=clindata_plusRF[which(!is.na(clindata_plusRF[,"event.rfs"])),]
write.table(clindata_plusRF,file=case_pred_outfile, sep="\t", quote=FALSE, col.names=TRUE, row.names=FALSE)

As before, the following lines will give an overview of the classifier's performance. This time instead of estimated performance from out-of-bag (OOB) testing we are measuring actual performance on an independent test set.

confusion=table(clindata_plusRF[,c("event.rfs","RF_predictions_responses")])
rownames(confusion)=c("NoRelapse","Relapse")
sensitivity=(confusion[2,2]/(confusion[2,2]+confusion[2,1]))*100
specificity=(confusion[1,1]/(confusion[1,1]+confusion[1,2]))*100
overall_error=((confusion[1,2]+confusion[2,1])/sum(confusion))*100
overall_accuracy=((confusion[1,1]+confusion[2,2])/sum(confusion))*100
class1_error=confusion[1,2]/(confusion[1,1]+confusion[1,2])
class2_error=confusion[2,1]/(confusion[2,2]+confusion[2,1])

Next we will prepare each useful statistic for writing to an output file

sens_out=paste("sensitivity=",sensitivity, sep="")
spec_out=paste("specificity=",specificity, sep="")
err_out=paste("overall error rate=",overall_error,sep="")
acc_out=paste("overall accuracy=",overall_accuracy,sep="")
misclass_1=paste(confusion[1,2], colnames(confusion)[1],"misclassified as", colnames(confusion)[2], sep=" ")
misclass_2=paste(confusion[2,1], colnames(confusion)[2],"misclassified as", colnames(confusion)[1], sep=" ")
confusion_out=confusion[1:2,1:2]
confusion_out=cbind(rownames(confusion_out), confusion_out)

Create variables for the known target class and predicted class probabilities.

target=clindata_plusRF[,"event.rfs"]
target[target==1]="Relapse"
target[target==0]="NoRelapse"
relapse_scores=clindata_plusRF[,"Relapse"]

Once again, we will create an ROC curve and calculate the area under it (AUC). Recall that we use the relapse/non-relapse vote fractions as predictive variable. The ROC curve is generated by stepping through different thresholds for calling relapse vs non-relapse.

First calculate the AUC value.

pred=prediction(relapse_scores,target)
perf_AUC=performance(pred,"auc")
AUC=perf_AUC@y.values[[1]]
AUC_out=paste("AUC=",AUC,sep="")

Print the results to file along with other stats prepared above.

write("confusion table", file=outfile)
write.table(confusion_out,file=outfile, sep="\t", quote=FALSE, col.names=TRUE, row.names=FALSE, append=TRUE)
write(c(sens_out,spec_out,acc_out,err_out,misclass_1,misclass_2,AUC_out), file=outfile, append=TRUE)

Then, plot the actual ROC curve.

perf_ROC=performance(pred,"tpr","fpr")
pdf(file=ROC_pdffile)
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(AUC, digits=5, scientific=FALSE)))
dev.off()

You should now have case prediction file with which we will perform some survival analysis. See: Machine learning for cancer classification - part 4 - Plotting a Kaplan-Meier Curve for Survival Analysis.

classification tutorial • 8.4k views
ADD COMMENTlink modified 10 months ago • written 4.3 years ago by Obi Griffith16k

I would be interested to look at a pythonic version of the tutorial utilizing scikits-learn or other library.

ADD REPLYlink written 4.2 years ago by Pappu1.8k

Me too. Are you volunteering to create one?

ADD REPLYlink written 4.2 years ago by Obi Griffith16k

Tutorials are written by experts in the field like you. I tried machine learning a few times on datasets which were too biased or short and the features chosen did not desribe them properly. So I did not publish any paper on it.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Pappu1.8k

Dear Obi

I am using your tutorial on my test data and Part 2 worked fabulously on my training data set and I thank you wholeheartedly for that, but now I have been stuck for several days. My code gets stuck at the prediction stage where I will be generating my RF predictions. This is the code I have received on my dataset

 

> header=colnames(test)
> predictor_data=(test[,1:length(header)])
> predictor_names=as.vector((test[,1]))
> colnames(predictor_data)=predictor_names
> load(file=RF_model_file)
> RF_predictions_responses=predict(rf_output, predictor_data, type="response")
Error in eval(expr, envir, enclos) : object 'SubjectTimepoint' not found

I am not sure what is going on. I wrote the prediction_names and predictor_data to a csv file but I am still very much confused. I am not sure why it is not finding SubjectTimePoint, it is there in my predictor_data. I have SubjectTimePoint (factor), Relapse(factor), Window(a factor which tells what month the timepoint is in) and then several genera. I want to predict relapse or no relapse from my training data set on this new test dataset. 

I tried looking up the error in StackOverflow but I am afraid I still do not understand what is going on.

I am new to R and I am just trying to learn, so please accept my apologies if this may come off awkward or off base. But if you can help me, I would so very grateful
Respectfully yours

Vijay Antharam 

(mst3000jay@gmail.com)

 

ADD REPLYlink written 2.3 years ago by mst3000jay0

Buenas noches, soy nuevo programando en R, alguien podría por favor decirme cómo solucionar este error? Good evening, I am new to programming in R, someone could tell me how to solve this error?

Good evening, I'm new programming in R GSE2034_clindata=getGSEDataTables("GSE2034")[[2]][1:286,]// run this error : Unknown IO error // and answer this I/O warning : failed to load external entity "http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?targ=self&form=xml&view=full&acc=GSE2034" Error in UseMethod("xmlRoot") : no applicable method for 'xmlRoot' applied to an object of class "NULL"

ADD REPLYlink written 23 months ago by edwinarevaloangulo0
0
gravatar for banerjeeshayantan
2.6 years ago by
banerjeeshayantan30 wrote:

While trying to run the random forest classifier I am getting this error,Please help!

> rf_output=randomForest(x=predictor_data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)
Error in randomForest.default(x = predictor_data, y = target, importance = TRUE,  : 
  length of response must be the same as predictors

FYI 

> dim(predictor_data)
[1] 283  52

> length(target)
[1] 286

The lengths are different but I wonder why?I followed each and every steps of this tutorial.

Please help!..Its kinda urgent!

ADD COMMENTlink written 2.6 years ago by banerjeeshayantan30

This is hard to guess. I just tried the commands in the first three tutorials and they worked fine. It looks like you are trying to modify the script for your own data. That is great. Are you expecting 283 cases or 286? This is likely an issue with how you have set up your data matrix and target object and is not specific to this tutorial. Carefully examine your predictor_data matrix. Does it have the expected number of rows (genes?) and columns (patients?). Ask the same question of your target vector.

ADD REPLYlink written 2.6 years ago by Obi Griffith16k
0
gravatar for David_emir
23 months ago by
David_emir230
India
David_emir230 wrote:

Hi,

While i Run the test data through forest

RF_predictions_responses=predict(rf_output, predictor_data, type="response") I am getting the following error:

"Variables in the training data missing in New data"

I did check, it looks fine to me, Kindly Help. I am using the same data set which is used in this tutorials.

Thanks a lot Dave.

ADD COMMENTlink written 23 months ago by David_emir230

Hi Dave, check the col names. If the col names in the original and the new data do not agree this error is thrown.

try ,

 identical(names(original_data),names(new_data))

if this returns "FALSE " the col names do not match. Make sure the col names in original data and the new data are same. Try and let me know.

ADD REPLYlink modified 23 months ago • written 23 months ago by kavya.krishnamurthy28410

Hi, I had the same problem as you. I used processAffyTestData.R script from the first part and it seems that the problem was this part of the script

gcrma=gcrma[1:12030,]

when I replaced it with

gcrma=gcrma[which(!grepl("AFFX", rownames(gcrma))),]

the problem disappeared

ADD REPLYlink modified 22 months ago • written 22 months ago by minio.cz10

This happens because over time the custom CDF files change so that hardcoded number to eliminate AFFX (control) probes becomes incorrect. It is then possible that some of these get included in the model. We added the alternative way to remove AFFX probes in some places but missed others and this likely created the disconnect where an AFFX probe got included in the trained model but then wasn't available in the test data. I've tried to make sure this is consistent everywhere now.

ADD REPLYlink written 10 months ago by Obi Griffith16k

Hi David,

I get the same error. I think, the number of predictor variables used to build the model equal to 1246 (after filtering). But, the number of predictor variables in the test data (ie predictor_data) equal to 12030. those numbers are different. I suggest to perform the filtering in such a way you keep only the predictor variables of the model in predictor_data.

ADD REPLYlink written 10 months ago by debitboro70
0
gravatar for debitboro
10 months ago by
debitboro70
Belgium
debitboro70 wrote:

Hi guys,

Is this script written by experts ??? I don't understand: You have constructed your model based on the filtered predictor variables of length equal to 1260, after that you want to test it on new cases described by 12030 predictor variables !!!

ADD COMMENTlink written 10 months ago by debitboro70
1

@debitboro - welcome to Biostars. The predict function will only use predictor variables that exist in the (previously trained) model that is provided. Running most recently, I get 1247 predictors after filtering the training data down from 12264 total predictors. The RF model is trained on, and only knows about those 1247 predictors. The test dataset also starts with 12264 total predictors (Note: these numbers can change depending on the version of customCDF you use, and those CDFs change as gene annotations evolve). When we supply the pre-trained RF model and the new test data to the predict function, it only considers the 1247 predictors in the RF model. If one of the necessary predictors was missing in the test data then you would get an error. But it doesn't care about extra predictors, they are simply ignored. in this case, since we created the train and test datasets using the exact same method we can expect all the necessary predictors will be there. It is not necessary but we could make this explicit by adding something like:

RF_predictor_names=rownames(rf_output$importance)
predictor_data=predictor_data[,RF_predictor_names]

This would reduce the predictor_data variable to just those genes in the RF model. But as explained above, it is not required since this happens automatically.

As an aside, it is poor etiquette to call into question the expertise of forum contributors. Questions and potential problems are of course welcome. But, the content provided here is not guaranteed to meet your expectations and the people who contribute it do so on a volunteer basis to help others, despite having many other obligations on their time and expertise.

ADD REPLYlink written 10 months ago by Obi Griffith16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 925 users visited in the last hour