Tutorial: Machine Learning For Cancer Classification - Part 1 - Preparing The Data Sets
54
gravatar for Obi Griffith
3.2 years ago by
Obi Griffith14k
Washington University, St Louis, USA
Obi Griffith14k wrote:

This tutorial is part of a series illustrating basic concepts and techniques for machine learning in R. We will try to build a classifier of relapse in breast cancer. The analysis plan will follow the general pattern (simplified) of a recent paper.

The first step is to prepare the data sets. We will use GSE2034 as a training data set and GSE2990 as a test data set. These are both data sets making use of the Affymetrix U133A platform (GPL96). First, let's download and pre-process the training data. Note: The gcrma step may require you to have as much as ~8gb of ram. I ran this tutorial on a Mac Pro (Snow Leopard) with R 3.0.2 installed. It should also work on linux or windows but package installation might differ slightly.

Set some data directories.

tempdir="/Users/ogriffit/temp/"
outdir="/Users/ogriffit/git/biostar-tutorials/MachineLearning"

Install the necessary bioconductor packages (if not already installed).

source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite("GEOquery")
biocLite("affy")
biocLite("gcrma")
biocLite("org.Hs.eg.db")

We will use custom CDF packages rather than the standard CDFs. Note, if the packages below do not install successfully with the biocLite method current source files can be installed manually. Go to the latest version, choose the link to ENTREZG for the appropriate version of R, find the HGU133A row and download three source packages ('C', 'P' and 'A'). In a terminal, run R CMD INSTALL pckgfilename where pckgfilename is the name of the package file you downloaded.

biocLite("hgu133ahsentrezgcdf")
biocLite("hgu133ahsentrezgprobe")
biocLite("hgu133ahsentrezg.db")

Then, load all necessary libraries

library(GEOquery)
library(affy)
library(gcrma)
library(hgu133ahsentrezgcdf) #cdfname="HGU133A_HS_ENTREZG"
library(hgu133ahsentrezgprobe)
library(hgu133ahsentrezg.db)

Set a data directory, download a GEO dataset, unpack and gunzip, and create a list of files for processing.

setwd(tempdir)
getGEOSuppFiles("GSE2034")
setwd(paste(tempdir,"GSE2034", sep=""))
untar("GSE2034_RAW.tar", exdir="data")
cels = list.files("data/", pattern = "CEL")
sapply(paste("data", cels, sep="/"), gunzip)
cels = list.files("data/", pattern = "CEL")

Create an AffyBatch object and perform GCRMA normalization.

setwd(paste(tempdir,"GSE2034/data", sep=""))
raw.data=ReadAffy(verbose=TRUE, filenames=cels, cdfname="HGU133A_HS_ENTREZG") 
data.gcrma.norm=gcrma(raw.data)

Extract expression values. Then, remove Affy control probes. In this custom CDF (Version 17.1.0, ENTREZG), these are found in rows 12031 to 12098. You can check for yourself and look for probe names starting with "AFFX" using 'rownames(gcrma)'. Alternatively you could use a regex to grab just those rows that don't contain "AFFX"

gcrma=exprs(data.gcrma.norm)
#gcrma=gcrma[1:12030,]
gcrma=gcrma[which(!grepl("AFFX", rownames(gcrma))),]

Map probe set identifiers to Entrez gene symbols and IDs and then combine with raw data.

probes=row.names(gcrma)
symbol = unlist(mget(probes, hgu133ahsentrezgSYMBOL))
ID = unlist(mget(probes, hgu133ahsentrezgENTREZID))
gcrma=cbind(probes,ID,symbol,gcrma)

Write GCRMA-normalized and mapped data to file.

setwd(outdir)
write.table(gcrma, file = "trainset_gcrma.txt", quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)

Get the clinical details for this dataset.

GSE2034_clindata=getGSEDataTables("GSE2034")[[2]][1:286,]
write.table(GSE2034_clindata, "trainset_clindetails.txt", quote = FALSE, sep = "\t", row.names = FALSE)

After running all of the above commands you should have a data file (trainset_gcrma.txt) with expression estimates for 12,030 genes and 286 breast cancer samples and a second file with clinical details (trainset_clindetails.txt). To avoid copy-pasting you can download the processAffyTrainData.R script.

IMPORTANT: Obtaining the test data set follows a very similar procedure and can be achieved by downloading and running the processAffyTestData.R script. If successful, the latter will give you a second expression data file (testset_gcrma.txt) with 189 additional breast cancer samples and another clinical details file (testset_clindetails.txt). You will need both the trainset___ and testset___ data files to successfully complete subsequent tutorials.

The next tutorial will show you how to train a randomForest model for classification of relapse versus non-relapse. See Machine learning for cancer classification - part 2 - Building a Random Forest Classifier. NOTE: this tutorial overlaps substantially with this tutorial but was modified here to produce the exact files we will need for subsequent parts of the series.

ADD COMMENTlink modified 4 weeks ago by yuanjun.ma0 • written 3.2 years ago by Obi Griffith14k
1

Are these two gene expression data sets truly independent or do they share some samples?

ADD REPLYlink written 3.2 years ago by Christian2.4k

I believe they are independent (non-overlapping) but you are smart to ask this question. See Figure 1 of my paper for a depiction of this problem.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Obi Griffith14k

Very nice figure and great you looked into that!

ADD REPLYlink written 3.2 years ago by Christian2.4k

Thanks for the tutorial. For the CDF file it should be HGU133a instead of HGU95A. Am I right?

ADD REPLYlink written 3.1 years ago by Diwan320

Yes. Thanks for spotting that typo. Corrected.

ADD REPLYlink written 3.1 years ago by Obi Griffith14k

Just a note: If biocLite("hgu133ahsentrezgcdf") fails (I got an error about R 3.0 version) After manually downloading the 3 files the command: R CMD INSTALL has to be used from command line not within R

To install within R, I downloaded the win32 version of the 3 files and then used: install.packages("~/path/hgu133ahsentrezgprobe_17.1.0.zip", repos = NULL)

I didn't know I have use repos=NULL and spent some time to figure it out. Hope it helps other like.

ADD REPLYlink written 3.1 years ago by Diwan320

Yes. Thank you for these tips. There are all kinds of issues with installing these packages, specific to linux, windows, mac and depending on version of R and bioconductor. Probably worth a tutorial in itself.

ADD REPLYlink written 3.1 years ago by Obi Griffith14k

The Dataset (GSE2034) mentioned above contains 22,283 probe Id's, which consists of multiple probes for a single gene or multiple genes for a single probe. After running all the above commands, you got expression estimates for 12,030 genes, that means it contains single expression value (probe) for each 12,030 genes. My questions is how the single expression value for each gene is obtained from the multiples sets of probe/genes. Through the commands I could not understand the process of how the numbers got reduced, and is there any other way to eliminate such multiples and get the single expression value for each gene ?

ADD REPLYlink written 18 months ago by swapnil900

I am using the custom CDF (from here, see instructions above). The group who produced them has remapped all probes to new probe sets based on Entrez Genes.

ADD REPLYlink written 18 months ago by Obi Griffith14k

HI ,

Tank you for your nice tutorial, but I want to ask you is it possible to apply this tutorial for an eSet,that I have reached from different platform?

 

ADD REPLYlink written 17 months ago by Shamim Sarhadi170

Thank you so much for the tutorial.

I get this error when I run the step to extract clinical details:

GSE2034_clindata=getGSEDataTables("GSE2034")[[2]][1:286,]

909 GSM36894 negative 109 0 ER+ 0 913 GSM36911 negative 80 1 ER+ 0 </internal-data> </data-table> </series>

</miniml> :3: parser error : xmlSAX2StartDocument <miniml ^="" error="" in="" usemethod("xpathapply")="" :="" no="" applicable="" method="" for="" 'xpathapply'="" applied="" to="" an="" object="" of="" class="" "null"="" in="" addition:="" warning="" message:="" in="" xmlroot.xmlinternaldocument(xmlparsedoc(txt))="" :="" empty="" xml="" document<="" p="">

Could you please help.

Thanks Sharvari

ADD REPLYlink written 4 weeks ago by sgujja20
0
gravatar for helen.smith-2
9 months ago by
helen.smith-20 wrote:

Hi Obi,

I have RNA-Seq data sets. How would I load them in to run Random forest? Would I need to use a completely different approach?

Many thanks

ADD COMMENTlink written 9 months ago by helen.smith-20
0
gravatar for Obi Griffith
9 months ago by
Obi Griffith14k
Washington University, St Louis, USA
Obi Griffith14k wrote:

The general principles are the same but this tutorial (part 1) is really focused on preparing a datafile from microarray data. You will want to use an alternative approach to create a file that is analogous to datafile="trainset_gcrma.txt". This would be a simple matrix of gene (or transcript or other feature-level) expression estimates (e.g., FPKM values) for each sample. You could then start with part 2 of this tutorial to try to train a RF model using that training data with some target classes. I suggest you go through at least parts 1 & 2 as documented here and really try to understand the inputs and outputs of each command. Then think about how your data fits this workflow.

ADD COMMENTlink written 9 months ago by Obi Griffith14k
0
gravatar for yuanjun.ma
4 weeks ago by
yuanjun.ma0
yuanjun.ma0 wrote:

Hi, I am a beginner of Machine Learning. When I started to use your R.script I was stuck at the last two commands.I can not get clinical tables for GSE2034 (trainset_clindetails.txt)! I have tried everything I could (for the whole week!). No problem with testset scripts (GSE2990).

GSE2034_clindata=getGSEDataTables("GSE2034")[[2]][1:286,]

Could you please help to upload the "trainset_clindetails.txt", or send the text file to me by email yuanjun.ma@ki.se)? That would help me a lot to continuing step 2: Machine Learning For Cancer Classification - Part 2 - Building A Random Forest Classifier.

(I have gotten all 3 files of Part I except train set_clindetails.txt.)

Merry Chirstmas!

ADD COMMENTlink written 4 weeks ago by yuanjun.ma0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1741 users visited in the last hour