Sample code for GSE79404
#!/usr/bin/Rscript
# general config
baseDir <- 'GSE79404/'
bgxfile <- 'Annot/HumanHT-12_V4_0_R2_15002873_B.bgx'
targetsfile <- 'targets.txt'
setwd(baseDir)
options(scipen = 99)
require(limma)
require(RColorBrewer)
require(PCAtools)
# read in the data and convert the data to an EListRaw object, which is a data object for single channel data
x <- read.table(paste0(baseDir, 'raw/GSE79404_non-normalized.txt'),
header = TRUE, sep = '\t', stringsAsFactors = FALSE, skip = 4)
newnames <- colnames(read.table(paste0(baseDir, 'raw/GSE79404_non-normalized.txt'),
header = TRUE, sep = '\t', stringsAsFactors = FALSE, skip = 3))
detectionpvalues <- x[,grep('Detection.Pval', colnames(x))]
x <- x[,-grep('Detection.Pval', colnames(x))]
probes <- x$ID_REF
x <- data.matrix(x[,2:ncol(x)])
rownames(x) <- probes
colnames(x) <- gsub('^X', '', colnames(x))
colnames(x) <- newnames[grep('GSM', newnames)]
# read in annotation
annot <- illuminaio::readBGX(bgxfile)$probes
annot <- annot[,which(colnames(annot) %in% c('Source','Symbol','Transcript','ILMN_Gene','RefSeq_ID',
'Entrez_Gene_ID','Symbol','Protein_Product','Probe_Id','Probe_Type',
'Probe_Start','Chromosome','Probe_Chr_Orientation','Probe_Coordinates',
'Cytoband', 'Definition', 'Ontology_Component', 'Ontology_Process',
'Ontology_Function', 'Synonyms'))]
annot <- annot[which(annot$Probe_Id %in% rownames(x)),]
annot <- annot[match(rownames(x), annot$Probe_Id),]
# update the target info
targetinfo <- readTargets(targetsfile, sep = '\t')
rownames(targetinfo) <- targetinfo$IDATfile
x <- x[,match(rownames(targetinfo), colnames(x))]
if (!all(colnames(x) == rownames(targetinfo)))
stop('Target info is not aligned to expression data - they must be in the same order')
# create a custom EListRaw object
project <- new('EListRaw')
project@.Data[[1]] <- 'illumina'
project@.Data[[2]] <- targetinfo
#project@.Data[[3]] <- annot
project@.Data[[3]] <- NULL
project@.Data[[4]] <- x
project@.Data[[5]] <- NULL
project$E <- x
project$targets <- targetinfo
#project$genes <- annot
project$genes <- NULL
project$other$Detection <- detectionpvalues
# generate QC plots of raw intensities
... ...
# for BeadArrays, background correction and normalisation are handled by a single function: neqc()
# this is the same as per Agilent single colour arrays
#
# perform background correction on the fluorescent intensities
# 'normexp' is beneficial in that it doesn't result in negative values, meaning no data is lost
# for ideal offset, see Gordon Smyth's answer, here: https://stat.ethz.ch/pipermail/bioconductor/2006-April/012554.html
# normalize the data with the 'quantile' method, to be consistent with RMA for Affymetrix arrays
project.bgcorrect.norm <- neqc(project, offset = 16)
# filter out control probes, those with no symbol, and those that failed
annot <- annot[which(annot$Probe_Id %in% rownames(project.bgcorrect.norm)),]
project.bgcorrect.norm <- project.bgcorrect.norm[which(rownames(project.bgcorrect.norm) %in% annot$Probe_Id),]
annot <- annot[match(rownames(project.bgcorrect.norm), annot$Probe_Id),]
project.bgcorrect.norm@.Data[[3]] <- annot
project.bgcorrect.norm$genes <- annot
Control <- project.bgcorrect.norm$genes$Source=="ILMN_Controls"
NoSymbol <- project.bgcorrect.norm$genes$Symbol == ""
isexpr <- rowSums(project.bgcorrect.norm$other$Detection <= 0.05) >= 3
project.bgcorrect.norm.filt <- project.bgcorrect.norm[!Control & !NoSymbol & isexpr, ]
dim(project.bgcorrect.norm)
dim(project.bgcorrect.norm.filt)
# remove annotation columns we no longer need
project.bgcorrect.norm.filt$genes <- project.bgcorrect.norm.filt$genes[,c(
'Probe_Id',
'Definition','Ontology_Component','Ontology_Process','Ontology_Function',
'Chromosome','Probe_Coordinates','Cytoband','Probe_Chr_Orientation',
'RefSeq_ID','Entrez_Gene_ID','Symbol')]
head(project.bgcorrect.norm.filt$genes)
# summarise across genes by mean
# ID is used to identify the replicates
project.bgcorrect.norm.filt.mean <- avereps(project.bgcorrect.norm.filt,
ID = project.bgcorrect.norm.filt$genes$Symbol)
dim(project.bgcorrect.norm.filt.mean)
targets.txt should look like:
IDATfile Group
GSM2094831 Healthy
GSM2094832 Healthy
GSM2094833 Disease
GSM2094834 Healthy
... ...
Hi Kevin! How are you ?
Using your responses, I did the pre process from raw data! (thank you very much for this!!) Now I'm going to proceed with differential expression analysis. Can I use the limma package for this tipe of data ? One more question, why i need summarise across genes by mean?
Olá! Sim / yes, you can use limma in the standard way.
You can use mean or median (or something else). Mean is okay in this scenario because we already filtered out 'bad' probes just after normalisation via
neqc()
. In other situations, mean can be affected by outlier values.Ok, in this case, can I use "project.bgcorrect.norm.filt.mean$E" or I have to extract the mean in another way ?
Yes,
project.bgcorrect.norm.filt.mean$E
is the expression matrix on which you can use limma to perfrom differential expression analysis.Hey Kevin! When I try proceed with DEA analysis ,I'm having a litle troubles. Can you help-me ?
It means that your variable names cannot have spaces or other characters like hyphens, back or forward slashes, etc.
Bad:
Good:
My Variables already have:
What is the output of:
?
I think I did something wrong...
I fixed the mistakes... now, when I run:
I recieve:
I returned on GEO2R, and, for this serie, aparently theres no Diferencially expression bethwen groups. Ps. I wanna compare Non-habits whith habits.
Hi kevin!For getting reliable expression matrices, I wanna know how to process the two non-normalized data from GEO database(GSE98278 & GSE47472). Here are the colnames of the two non-normalized data.Could someone share the codes?This will be a great help for me.
Hello Kevin, I am pretty new to R and very unfamiliar with Illumina data. This script is helping a lot, I am working on GSE38481 and GSE38484, however, to understand this workflow I downloaded the dataset you mentioned and am running the pipeline. But when I do so I cannot seem to locate a bgx or targets file. How do I get those?
Hi, the bgx file is just an annotation file - you technically don't need it if you can create your own annotation file via another means, like Bioconductor annotation packages.
This code should create a master table for the Illumina Human v4
Other annotation packages are listed here: https://www.bioconductor.org/packages/release/BiocViews.html#___AnnotationData
Going this way involves even more customisation, though, and increased risk.
Regarding the targets file, that is for you to create manually - it's basically the metadata file.
Oh alright! Thanks!