2
0
Entering edit mode
2.9 years ago
bhargavdave1 ▴ 10

I have this GEO Illumina data GSE20159 that I need to analyze. This is an Illumina HumanHT-12 V3.0 expression beadchip. I work on Affymetrix-based microarray but this is the first time I am using Illumina data. GSE20159 has GSE20159_non-normalized.txt.gz which contains only one GSE20159_non-normalized.txt. So how to process this file and get expression and how to do Background correction/normalization this data using neqc.

microarray illumina affy gene expression • 2.6k views
4
Entering edit mode
2.9 years ago

Edit November 28, 2020:

Further reproducible code: A: GPL6883_HumanRef-8_V3_0_R0_11282963_A (illumina expression beadchip)

## --

Most Illumina 'chip' studies that I have seen on GEO do not contain the raw data IDAT files. You can start with the tab-delimited file, but will also require the annotation file (contained in the *_RAW.tar file), and the usual targets file for limma.

The targets file may look like this:

IDATfile    Group
raw/EX249_001.idat  Peripheral_Blood_cDC1
raw/EX249_005.idat  Peripheral_Blood_cDC2
raw/EX249_007.idat  Peripheral_Blood_early_pre_DC
...


Please! always ensure that this is aligned correctly with the data with which you are working. Some of these functions have no internal checks to ensure that the targets file data is aligned to the expression data. Check it regularly.

# general config
baseDir <- '/home/kblighe/Escritorio/GSE20159/'
bgxfile <- 'Annot/GPL6947_HumanHT-12_V3_0_R1_11283641_A.bgx.gz'
targetsfile <- 'targets.txt'
setwd(baseDir)
options(scipen = 99)
require(limma)

# read in the data and convert the data to an EListRaw object
paste0(baseDir, 'raw/GSE80171_non_normalized.txt'),
header = TRUE, sep = '\t', stringsAsFactors = FALSE, skip = 0)

# extract detection p-value columns
detectionpvalues <- x[,grep('Detection.Pval', colnames(x))]
x <- x[,-grep('Detection.Pval', colnames(x))]

# set rownames and tidy up final expression matrix
probes <- xID_REF x <- data.matrix(x[,2:ncol(x)]) rownames(x) <- probes colnames(x) <- gsub('^X', '', gsub('\\.AVG_Signal', '', colnames(x))) # rename samples manually (optional) ... # read in annotation and align it with the expression data annot <- illuminaio::readBGX(bgxfile)probes
annot <- annot[,which(colnames(annot) %in% c('Source','Symbol','Transcript','ILMN_Gene','RefSeq_ID',
'Entrez_Gene_ID','Symbol','Protein_Product','Probe_Id','Probe_Type',
'Probe_Start','Chromosome','Probe_Chr_Orientation','Probe_Coordinates',
'Cytoband', 'Definition', 'Ontology_Component', 'Ontology_Process',
'Ontology_Function', 'Synonyms'))]
annot <- annot[which(annot$Probe_Id %in% rownames(x)),] annot <- annot[match(rownames(x), annot$Probe_Id),]

# update the target info
targetinfo <- readTargets(targetsfile, sep = '\t')
rownames(targetinfo) <- gsub('\\.idat$', '', gsub('^raw/', '', targetinfo$IDATfile))
x <- x[,match(rownames(targetinfo), colnames(x))]
if (!all(colnames(x) == rownames(targetinfo)))
stop('Target info is not aligned to expression data - they must be in the same order')

# create a custom EListRaw object
project <- new('EListRaw')
project@.Data[[1]] <- 'illumina'
project@.Data[[2]] <- targetinfo
project@.Data[[3]] <- NULL
project@.Data[[4]] <- x
project@.Data[[5]] <- NULL
project$E <- x project$targets <- targetinfo
project$genes <- NULL project$other$Detection <- detectionpvalues # generate QC plots of raw intensities dir.create('QC/') ... # for BeadArrays, background correction and normalisation are handled by a single function: neqc() project.bgcorrect.norm <- neqc(project, offset = 16) # filter out control probes, those with no symbol, and those that failed annot <- annot[which(annot$Probe_Id %in% rownames(project.bgcorrect.norm)),]
project.bgcorrect.norm <- project.bgcorrect.norm[which(rownames(project.bgcorrect.norm) %in% annot$Probe_Id),] annot <- annot[match(rownames(project.bgcorrect.norm), annot$Probe_Id),]
project.bgcorrect.norm@.Data[[3]] <- annot
project.bgcorrect.norm$genes <- annot Control <- project.bgcorrect.norm$genes$Source=="ILMN_Controls" NoSymbol <- project.bgcorrect.norm$genes$Symbol == "" isexpr <- rowSums(project.bgcorrect.norm$other$Detection <= 0.05) >= 3 project.bgcorrect.norm.filt <- project.bgcorrect.norm[!Control & !NoSymbol & isexpr, ] dim(project.bgcorrect.norm) dim(project.bgcorrect.norm.filt) # remove annotation columns we no longer need project.bgcorrect.norm.filt$genes <- project.bgcorrect.norm.filt$genes[,c( 'Probe_Id', 'Definition','Ontology_Component','Ontology_Process','Ontology_Function', 'Chromosome','Probe_Coordinates','Cytoband','Probe_Chr_Orientation', 'RefSeq_ID','Entrez_Gene_ID','Symbol')] head(project.bgcorrect.norm.filt$genes)

# summarise across genes by mean - use ID
project.bgcorrect.norm.filt.mean <- avereps(project.bgcorrect.norm.filt,
ID = project.bgcorrect.norm.filt$genes$Symbol)

0
Entering edit mode

Hello Kevin,

I could not find target file for my dataset (GSE48072) on GEO neither could I locate one for GSE20159 mentioned here. Have I overlooked something?

Regarding filtering probes based on Detection P value:

So far, I have used lumi package for the analysis. I have tried using 'all_vars' and 'any_vars' function, which has given me two extreme outputs (all_vars=10k, any_vars=37k, P val <0.05 of the total 47k probes). I was thinking to filter out probes based on some condition like you have performed here (>=3). However, I would like to know why >=3 ? is it widely used? sorry for my naive question but I am not aware of it.

Regarding background correction:

Also, is it necessary to perform correction for beadarray? I was under impression that non-normalized data deposited on GEO for bead array are already background corrected. For now, I have managed to perform quantile normalization for the data using lumiN function. Which does not perform background correction. I am really not sure how I should go about this. But I tried using neqc for my expression set but I get following error:

(GSE48072 is the LumiBatchobject) neqc(GSE48072, offset = 16) Error in normexp.fit.detection.p(x, detection.p) : detection.p must be a numeric matrix (unless x is an EListRaw)

It would be really helpful if you can guide me.

Thanks

0
Entering edit mode

Hello Kevin,

Hello

I could not find target file for my dataset (GSE48072) on GEO neither could I locate one for GSE20159 mentioned here. Have I overlooked something?

You will have to create it

However, I would like to know why >=3 ? is it widely used?

I use it because it is the default in the limma manual section on processing Illumina arrays. You can use any reasonable value

Also, is it necessary to perform correction for beadarray? I was under impression that non-normalized data deposited on GEO for bead array are already background corrected.

The data that is deposited on GEO may not be what you think. You should check the author-provided details by clicking on each GSM ID from the main GSE accession page and see if there is any extra information on how they pre-processed their data, and which data they deposited on GEO.

But I tried using neqc for my expression set but I get following error:

Assuming that you follow my exact code, which I would regard as being for [very] advanced users only, it should work. What is the output of str(detectionpvalues) Obviously it should have the exact same dimensions as the expression data

0
Entering edit mode

I posted further reproducible code, here, by the way:

0
Entering edit mode

Thanks, Kevin.

This clears most of my doubts.

About neqc(): No, I haven't followed your exact code. I just thought neqc might come to the rescue if these values aren't background-corrected. I did check GSM ids on GEO, but that dictates the entire preprocessing for the samples. And that doesn't clear if the non-normalized data is background-corrected or not. Maybe I will read more about it.

Thanks for the response.

0
Entering edit mode

Yes, I have no idea why GEO does not request the IDAT files to be uploaded, like it requests CEL files for Affymetrix and TXT files for Agilent.

You could check the distribution of the data via box-and-whiskers in order to determine whether or not it may have already been normalised.

1
Entering edit mode

Sure I will do that.

Thank you once again.

0
Entering edit mode

Hi Kevin, Can you please suggest from your knowledge is there any way we could check from plots or any other means to figure out that the Illumina raw data.txt submitted to GEO is already background corrected or not. Thanks much. I mean how can I decide If I need to do it or not from whatever we have ?

0
Entering edit mode

If you click on one of the links for a GSM ID, it should show the methods that have been applied. If you want advice for a particular GEO study, you could let me know the GSE ID? Unfortunately, for the illumina studies, it is never straight forward...

2
Entering edit mode
2.6 years ago

hi,hope my comment could help you.I meet the same question like you. And I read limma's usersguide.Find a way to deal with the data

#background correction and normalisation

library(limma)
y <- neqc(x,offset=16,detection.p="Detection Pval")

#Filter out probes that are not expressed. We keep probes that are expressed in at least three arrays according to a detection p-values of 5%
expressed <- rowSums(y$other$Detection < 0.05) >= 3
y <- y[expressed,]
dim(y)

#output
expdata <- y\$E
write.csv(expdata,file="F:/mydata/b.csv",quote=T,row.names = T)


I use Excel to proofread the ProbeID

I am not an expert in this field.If my code have something wrong，hope others could tell me.