Tutorial: Machine Learning For Cancer Classification - Part 1 - Preparing The Data Sets
gravatar for Obi Griffith
3.5 years ago by
Obi Griffith15k
Washington University, St Louis, USA
Obi Griffith15k wrote:

This tutorial is part of a series illustrating basic concepts and techniques for machine learning in R. We will try to build a classifier of relapse in breast cancer. The analysis plan will follow the general pattern (simplified) of a recent paper.

The first step is to prepare the data sets. We will use GSE2034 as a training data set and GSE2990 as a test data set. These are both data sets making use of the Affymetrix U133A platform (GPL96). First, let's download and pre-process the training data. Note: The gcrma step may require you to have as much as ~8gb of ram. I ran this tutorial on a Mac Pro (Snow Leopard) with R 3.0.2 installed. It should also work on linux or windows but package installation might differ slightly.

Set some data directories.


Install the necessary bioconductor packages (if not already installed).


We will use custom CDF packages rather than the standard CDFs. Note, if the packages below do not install successfully with the biocLite method current source files can be installed manually. Go to the latest version, choose the link to ENTREZG for the appropriate version of R, find the HGU133A row and download three source packages ('C', 'P' and 'A'). In a terminal, run R CMD INSTALL pckgfilename where pckgfilename is the name of the package file you downloaded.


Then, load all necessary libraries

library(hgu133ahsentrezgcdf) #cdfname="HGU133A_HS_ENTREZG"

Set a data directory, download a GEO dataset, unpack and gunzip, and create a list of files for processing.

setwd(paste(tempdir,"GSE2034", sep=""))
untar("GSE2034_RAW.tar", exdir="data")
cels = list.files("data/", pattern = "CEL")
sapply(paste("data", cels, sep="/"), gunzip)
cels = list.files("data/", pattern = "CEL")

Create an AffyBatch object and perform GCRMA normalization.

setwd(paste(tempdir,"GSE2034/data", sep=""))
raw.data=ReadAffy(verbose=TRUE, filenames=cels, cdfname="HGU133A_HS_ENTREZG") 

Extract expression values. Then, remove Affy control probes. In this custom CDF (Version 17.1.0, ENTREZG), these are found in rows 12031 to 12098. You can check for yourself and look for probe names starting with "AFFX" using 'rownames(gcrma)'. Alternatively you could use a regex to grab just those rows that don't contain "AFFX"

gcrma=gcrma[which(!grepl("AFFX", rownames(gcrma))),]

Map probe set identifiers to Entrez gene symbols and IDs and then combine with raw data.

symbol = unlist(mget(probes, hgu133ahsentrezgSYMBOL))
ID = unlist(mget(probes, hgu133ahsentrezgENTREZID))

Write GCRMA-normalized and mapped data to file.

write.table(gcrma, file = "trainset_gcrma.txt", quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)

Get the clinical details for this dataset.

write.table(GSE2034_clindata, "trainset_clindetails.txt", quote = FALSE, sep = "\t", row.names = FALSE)

After running all of the above commands you should have a data file (trainset_gcrma.txt) with expression estimates for 12,030 genes and 286 breast cancer samples and a second file with clinical details (trainset_clindetails.txt). To avoid copy-pasting you can download the processAffyTrainData.R script.

IMPORTANT: Obtaining the test data set follows a very similar procedure and can be achieved by downloading and running the processAffyTestData.R script. If successful, the latter will give you a second expression data file (testset_gcrma.txt) with 189 additional breast cancer samples and another clinical details file (testset_clindetails.txt). You will need both the trainset___ and testset___ data files to successfully complete subsequent tutorials.

The next tutorial will show you how to train a randomForest model for classification of relapse versus non-relapse. See Machine learning for cancer classification - part 2 - Building a Random Forest Classifier. NOTE: this tutorial overlaps substantially with this tutorial but was modified here to produce the exact files we will need for subsequent parts of the series.

ADD COMMENTlink modified 4 months ago by yuanjun.ma0 • written 3.5 years ago by Obi Griffith15k

Are these two gene expression data sets truly independent or do they share some samples?

ADD REPLYlink written 3.5 years ago by Christian2.5k

I believe they are independent (non-overlapping) but you are smart to ask this question. See Figure 1 of my paper for a depiction of this problem.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Obi Griffith15k

Very nice figure and great you looked into that!

ADD REPLYlink written 3.5 years ago by Christian2.5k

Thanks for the tutorial. For the CDF file it should be HGU133a instead of HGU95A. Am I right?

ADD REPLYlink written 3.4 years ago by Diwan350

Yes. Thanks for spotting that typo. Corrected.

ADD REPLYlink written 3.4 years ago by Obi Griffith15k

Just a note: If biocLite("hgu133ahsentrezgcdf") fails (I got an error about R 3.0 version) After manually downloading the 3 files the command: R CMD INSTALL has to be used from command line not within R

To install within R, I downloaded the win32 version of the 3 files and then used: install.packages("~/path/hgu133ahsentrezgprobe_17.1.0.zip", repos = NULL)

I didn't know I have use repos=NULL and spent some time to figure it out. Hope it helps other like.

ADD REPLYlink written 3.4 years ago by Diwan350

Yes. Thank you for these tips. There are all kinds of issues with installing these packages, specific to linux, windows, mac and depending on version of R and bioconductor. Probably worth a tutorial in itself.

ADD REPLYlink written 3.4 years ago by Obi Griffith15k

The Dataset (GSE2034) mentioned above contains 22,283 probe Id's, which consists of multiple probes for a single gene or multiple genes for a single probe. After running all the above commands, you got expression estimates for 12,030 genes, that means it contains single expression value (probe) for each 12,030 genes. My questions is how the single expression value for each gene is obtained from the multiples sets of probe/genes. Through the commands I could not understand the process of how the numbers got reduced, and is there any other way to eliminate such multiples and get the single expression value for each gene ?

ADD REPLYlink written 21 months ago by swapnil900

I am using the custom CDF (from here, see instructions above). The group who produced them has remapped all probes to new probe sets based on Entrez Genes.

ADD REPLYlink written 21 months ago by Obi Griffith15k

HI ,

Tank you for your nice tutorial, but I want to ask you is it possible to apply this tutorial for an eSet,that I have reached from different platform?


ADD REPLYlink written 20 months ago by Shamim Sarhadi170

Thank you so much for the tutorial.

I get this error when I run the step to extract clinical details:


909 GSM36894 negative 109 0 ER+ 0 913 GSM36911 negative 80 1 ER+ 0 </internal-data> </data-table> </series>

</miniml> :3: parser error : xmlSAX2StartDocument <miniml ^="" error="" in="" usemethod("xpathapply")="" :="" no="" applicable="" method="" for="" 'xpathapply'="" applied="" to="" an="" object="" of="" class="" "null"="" in="" addition:="" warning="" message:="" in="" xmlroot.xmlinternaldocument(xmlparsedoc(txt))="" :="" empty="" xml="" document<="" p="">

Could you please help.

Thanks Sharvari

ADD REPLYlink written 4 months ago by sgujja20

Hi Obi,

I try to retrieve the clinical data using the command


I get the following error:

Error in UseMethod("xmlRoot") :
no applicable method for 'xmlRoot' applied to an object of class "NULL"

What I have already tried is to use the new function getGSEDataTables() from github. By doing that, I can retrieve the data, but I still get the following error:

883     GSM37032        negative        86      0       ER+     0
887     GSM37033        negative        161     0       ER+     0
890     GSM37043        negative        161     0       ER-     0
891     GSM37044        negative        112     0       ER+     0
894     GSM37045        negative        123     0       ER-     0
899     GSM37046        negative        86      0       ER+     0
900     GSM36887        negative        108     0       ER+     0
901     GSM36889        negative        108     0       ER+     0
903     GSM36892        negative        110     0       ER+     0
909     GSM36894        negative        109     0       ER+     0
913     GSM36911        negative        80      1       ER+     0

:3: parser error : xmlSAX2StartDocument
Error in UseMethod("xpathApply") :
  no applicable method for 'xpathApply' applied to an object of class "NULL"
In addition: Warning message:
In xmlRoot.XMLInternalDocument(xmlParseDoc(txt)) : empty XML document
ADD REPLYlink written 14 days ago by debitboro50
gravatar for helen.smith-2
12 months ago by
helen.smith-20 wrote:

Hi Obi,

I have RNA-Seq data sets. How would I load them in to run Random forest? Would I need to use a completely different approach?

Many thanks

ADD COMMENTlink written 12 months ago by helen.smith-20

The general principles are the same but this tutorial (part 1) is really focused on preparing a datafile from microarray data. You will want to use an alternative approach to create a file that is analogous to datafile="trainset_gcrma.txt". This would be a simple matrix of gene (or transcript or other feature-level) expression estimates (e.g., TPM or FPKM values) for each sample. You could then start with Machine Learning For Cancer Classification - Part 2 - Building A Random Forest Classifier with relatively minor modifications (e.g., gene filter strategy would be slightly different) to try to train a RF model using that training data with some target classes. I suggest you go through at least parts 1 & 2 as documented here and really try to understand the inputs and outputs of each command. Then think about how your data fits this workflow.

ADD REPLYlink modified 7 days ago • written 7 days ago by Obi Griffith15k
gravatar for yuanjun.ma
4 months ago by
yuanjun.ma0 wrote:

Hi, I am a beginner of Machine Learning. When I started to use your R.script I was stuck at the last two commands.I can not get clinical tables for GSE2034 (trainset_clindetails.txt)! I have tried everything I could (for the whole week!). No problem with testset scripts (GSE2990).


Could you please help to upload the "trainset_clindetails.txt", or send the text file to me by email yuanjun.ma@ki.se)? That would help me a lot to continuing step 2: Machine Learning For Cancer Classification - Part 2 - Building A Random Forest Classifier.

(I have gotten all 3 files of Part I except train set_clindetails.txt.)

Merry Chirstmas!

ADD COMMENTlink written 4 months ago by yuanjun.ma0

All of the files can be found here: https://github.com/obigriffith/biostar-tutorials/tree/master/MachineLearning. The file you are specifically looking for could be downloaded here: https://raw.githubusercontent.com/obigriffith/biostar-tutorials/master/MachineLearning/trainset_clindetails.txt

ADD REPLYlink written 7 days ago by Obi Griffith15k

Please note that the trainset_gcrma.txt and testset_gcrma.txt can change over time if using different versions of the custom CDF annotations. The clindetails.txt files should be static. I don't know why that command doesn't work. It seems like there are regularly issues with GEO services failing or files being moved. I have had periodic issues with both datasets.

ADD REPLYlink written 5 days ago by Obi Griffith15k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1682 users visited in the last hour