This tutorial is part of a series illustrating basic concepts and techniques for machine learning in R. We will try to build a classifier of relapse in breast cancer. The analysis plan will follow the general pattern (simplified) of a recent paper.
The first step is to prepare the data sets. We will use GSE2034 as a training data set and GSE2990 as a test data set. These are both data sets making use of the Affymetrix U133A platform (GPL96). First, let's download and pre-process the training data. Note: The gcrma step may require you to have as much as ~8gb of ram. I ran this tutorial on a Mac Pro (Snow Leopard) with R 3.0.2 installed. It should also work on linux or windows but package installation might differ slightly.
Set some data directories.
Install the necessary bioconductor packages (if not already installed).
source("http://bioconductor.org/biocLite.R") biocLite() biocLite("GEOquery") biocLite("affy") biocLite("gcrma") biocLite("org.Hs.eg.db")
We will use custom CDF packages rather than the standard CDFs. Note, if the packages below do not install successfully with the biocLite method current source files can be installed manually. Go to the latest version, choose the link to ENTREZG for the appropriate version of R, find the HGU133A row and download three source packages ('C', 'P' and 'A'). In a terminal, run R CMD INSTALL pckgfilename where pckgfilename is the name of the package file you downloaded.
biocLite("hgu133ahsentrezgcdf") biocLite("hgu133ahsentrezgprobe") biocLite("hgu133ahsentrezg.db")
Then, load all necessary libraries
library(GEOquery) library(affy) library(gcrma) library(hgu133ahsentrezgcdf) #cdfname="HGU133A_HS_ENTREZG" library(hgu133ahsentrezgprobe) library(hgu133ahsentrezg.db)
Set a data directory, download a GEO dataset, unpack and gunzip, and create a list of files for processing.
setwd(tempdir) getGEOSuppFiles("GSE2034") setwd(paste(tempdir,"GSE2034", sep="")) untar("GSE2034_RAW.tar", exdir="data") cels = list.files("data/", pattern = "CEL") sapply(paste("data", cels, sep="/"), gunzip) cels = list.files("data/", pattern = "CEL")
Create an AffyBatch object and perform GCRMA normalization.
setwd(paste(tempdir,"GSE2034/data", sep="")) raw.data=ReadAffy(verbose=TRUE, filenames=cels, cdfname="HGU133A_HS_ENTREZG") data.gcrma.norm=gcrma(raw.data)
Extract expression values. Then, remove Affy control probes. In this custom CDF (Version 17.1.0, ENTREZG), these are found in rows 12031 to 12098. You can check for yourself and look for probe names starting with "AFFX" using
rownames(gcrma). Alternatively you could use a regex to grab just those rows that don't contain "AFFX"
gcrma=exprs(data.gcrma.norm) #gcrma=gcrma[1:12030,] gcrma=gcrma[which(!grepl("AFFX", rownames(gcrma))),]
Map probe set identifiers to Entrez gene symbols and IDs and then combine with raw data.
probes=row.names(gcrma) symbol = unlist(mget(probes, hgu133ahsentrezgSYMBOL)) ID = unlist(mget(probes, hgu133ahsentrezgENTREZID)) gcrma=cbind(probes,ID,symbol,gcrma)
Write GCRMA-normalized and mapped data to file.
setwd(outdir) write.table(gcrma, file = "trainset_gcrma.txt", quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)
Get the clinical details for this dataset.
GSE2034_clindata=getGSEDataTables("GSE2034")[][1:286,] write.table(GSE2034_clindata, "trainset_clindetails.txt", quote = FALSE, sep = "\t", row.names = FALSE)
After running all of the above commands you should have a data file (
trainset_gcrma.txt) with expression estimates for 12,030 genes and 286 breast cancer samples and a second file with clinical details (
trainset_clindetails.txt). To avoid copy-pasting you can download the processAffyTrainData.R script.
IMPORTANT: Obtaining the test data set follows a very similar procedure and can be achieved by downloading and running the processAffyTestData.R script. If successful, the latter will give you a second expression data file (
testset_gcrma.txt) with 189 additional breast cancer samples and another clinical details file (
testset_clindetails.txt). You will need both the
testset___ data files to successfully complete subsequent tutorials.
The next tutorial will show you how to train a randomForest model for classification of relapse versus non-relapse. See Machine learning for cancer classification - part 2 - Building a Random Forest Classifier. NOTE: this tutorial overlaps substantially with this tutorial but was modified here to produce the exact files we will need for subsequent parts of the series.
Are these two gene expression data sets truly independent or do they share some samples?
I believe they are independent (non-overlapping) but you are smart to ask this question. See Figure 1 of my paper for a depiction of this problem.
Very nice figure and great you looked into that!
Thanks for the tutorial. For the CDF file it should be HGU133a instead of HGU95A. Am I right?
Yes. Thanks for spotting that typo. Corrected.
Just a note: If biocLite("hgu133ahsentrezgcdf") fails (I got an error about R 3.0 version) After manually downloading the 3 files the command: R CMD INSTALL has to be used from command line not within R
To install within R, I downloaded the win32 version of the 3 files and then used: install.packages("~/path/hgu133ahsentrezgprobe_17.1.0.zip", repos = NULL)
I didn't know I have use repos=NULL and spent some time to figure it out. Hope it helps other like.
Yes. Thank you for these tips. There are all kinds of issues with installing these packages, specific to linux, windows, mac and depending on version of R and bioconductor. Probably worth a tutorial in itself.
The Dataset (GSE2034) mentioned above contains 22,283 probe Id's, which consists of multiple probes for a single gene or multiple genes for a single probe. After running all the above commands, you got expression estimates for 12,030 genes, that means it contains single expression value (probe) for each 12,030 genes. My questions is how the single expression value for each gene is obtained from the multiples sets of probe/genes. Through the commands I could not understand the process of how the numbers got reduced, and is there any other way to eliminate such multiples and get the single expression value for each gene ?
I am using the custom CDF (from here, see instructions above). The group who produced them has remapped all probes to new probe sets based on Entrez Genes.
Thank you for your nice tutorial, but I want to ask you is it possible to apply this tutorial for an eSet that I have reached from different platform?
I have RNA-Seq data sets. How would I load them in to run Random forest? Would I need to use a completely different approach?
The general principles are the same but this tutorial (part 1) is really focused on preparing a datafile from microarray data. You will want to use an alternative approach to create a file that is analogous to datafile="trainset_gcrma.txt". This would be a simple matrix of gene (or transcript or other feature-level) expression estimates (e.g., TPM or FPKM values) for each sample. You could then start with Machine Learning For Cancer Classification - Part 2 - Building A Random Forest Classifier with relatively minor modifications (e.g., gene filter strategy would be slightly different) to try to train a RF model using that training data with some target classes. I suggest you go through at least parts 1 & 2 as documented here and really try to understand the inputs and outputs of each command. Then think about how your data fits this workflow.
Hi, I am a beginner of Machine Learning. When I started to use your R.script I was stuck at the last two commands.I can not get clinical tables for GSE2034 (trainset_clindetails.txt)! I have tried everything I could (for the whole week!). No problem with testset scripts (GSE2990).
Could you please help to upload the "trainset_clindetails.txt", or send the text file to me by email email@example.com)? That would help me a lot to continuing step 2: Machine Learning For Cancer Classification - Part 2 - Building A Random Forest Classifier.
(I have gotten all 3 files of Part I except train set_clindetails.txt.)
All of the files can be found here: https://github.com/obigriffith/biostar-tutorials/tree/master/MachineLearning. The file you are specifically looking for could be downloaded here: https://raw.githubusercontent.com/obigriffith/biostar-tutorials/master/MachineLearning/trainset_clindetails.txt
Please note that the trainset_gcrma.txt and testset_gcrma.txt can change over time if using different versions of the custom CDF annotations. The clindetails.txt files should be static. I don't know why that command doesn't work. It seems like there are regularly issues with GEO services failing or files being moved. I have had periodic issues with both datasets.
Thank you so much for the tutorial.
I get this error when I run the step to extract clinical details:
909 GSM36894 negative 109 0 ER+ 0 913 GSM36911 negative 80 1 ER+ 0 </internal-data> </data-table> </series>
</miniml> :3: parser error : xmlSAX2StartDocument <miniml ^="" error="" in="" usemethod("xpathapply")="" :="" no="" applicable="" method="" for="" 'xpathapply'="" applied="" to="" an="" object="" of="" class="" "null"="" in="" addition:="" warning="" message:="" in="" xmlroot.xmlinternaldocument(xmlparsedoc(txt))="" :="" empty="" xml="" document<="" p="">
Could you please help.
I try to retrieve the clinical data using the command
I get the following error:
What I have already tried is to use the new function getGSEDataTables() from github. By doing that, I can retrieve the data, but I still get the following error:
Thank you for nice tutorial. I am also suffering from similar error with debitboro.
No encoding supplied: defaulting to UTF-8. Error in UseMethod("xpathApply") : no applicable method for 'xpathApply' applied to an object of class "NULL" In addition: Warning message: In xmlRoot.XMLInternalDocument(xmlParseDoc(txt)) : empty XML document
Would you please help?