Question: Illumina HumanHT-12 V3.0 expression beadchip reading data
0
gravatar for bhargavdave1
10 months ago by
bhargavdave10 wrote:

I have this GEO Illumina data GSE20159 that I need to analyze. This is an Illumina HumanHT-12 V3.0 expression beadchip. I work on Affymetrix-based microarray but this is the first time I am using Illumina data. GSE20159 has GSE20159_non-normalized.txt.gz which contains only one GSE20159_non-normalized.txt. So how to process this file and get expression and how to do Background correction/normalization this data using neqc.

ADD COMMENTlink modified 6 months ago by zhijianzheng00720 • written 10 months ago by bhargavdave10
4
gravatar for Kevin Blighe
10 months ago by
Kevin Blighe69k
Republic of Ireland
Kevin Blighe69k wrote:

Edit November 28, 2020:

Further reproducible code: A: GPL6883_HumanRef-8_V3_0_R0_11282963_A (illumina expression beadchip)

--

Most Illumina 'chip' studies that I have seen on GEO do not contain the raw data IDAT files. You can start with the tab-delimited file, but will also require the annotation file (contained in the *_RAW.tar file), and the usual targets file for limma.

The targets file may look like this:

IDATfile    Group
raw/EX249_001.idat  Peripheral_Blood_cDC1
raw/EX249_005.idat  Peripheral_Blood_cDC2
raw/EX249_007.idat  Peripheral_Blood_early_pre_DC
...

Please! always ensure that this is aligned correctly with the data with which you are working. Some of these functions have no internal checks to ensure that the targets file data is aligned to the expression data. Check it regularly.

# general config
  baseDir <- '/home/kblighe/Escritorio/GSE20159/'
  bgxfile <- 'Annot/GPL6947_HumanHT-12_V3_0_R1_11283641_A.bgx.gz'
  targetsfile <- 'targets.txt'
  setwd(baseDir)
  options(scipen = 99)
  require(limma)


# read in the data and convert the data to an EListRaw object
  x <- read.table(
    paste0(baseDir, 'raw/GSE80171_non_normalized.txt'),
    header = TRUE, sep = '\t', stringsAsFactors = FALSE, skip = 0)

  # extract detection p-value columns
    detectionpvalues <- x[,grep('Detection.Pval', colnames(x))]
    x <- x[,-grep('Detection.Pval', colnames(x))]

  # set rownames and tidy up final expression matrix
  probes <- x$ID_REF
  x <- data.matrix(x[,2:ncol(x)])
  rownames(x) <- probes
  colnames(x) <- gsub('^X', '',
    gsub('\\.AVG_Signal', '', colnames(x)))


# rename samples manually (optional)
  ...


# read in annotation and align it with the expression data
  annot <- illuminaio::readBGX(bgxfile)$probes
  annot <- annot[,which(colnames(annot) %in% c('Source','Symbol','Transcript','ILMN_Gene','RefSeq_ID',
    'Entrez_Gene_ID','Symbol','Protein_Product','Probe_Id','Probe_Type',
    'Probe_Start','Chromosome','Probe_Chr_Orientation','Probe_Coordinates',
    'Cytoband', 'Definition', 'Ontology_Component', 'Ontology_Process',
    'Ontology_Function', 'Synonyms'))]
  annot <- annot[which(annot$Probe_Id %in% rownames(x)),]
  annot <- annot[match(rownames(x), annot$Probe_Id),]


# update the target info
  targetinfo <- readTargets(targetsfile, sep = '\t')
  rownames(targetinfo) <- gsub('\\.idat$', '',
    gsub('^raw/', '', targetinfo$IDATfile))
  x <- x[,match(rownames(targetinfo), colnames(x))]
  if (!all(colnames(x) == rownames(targetinfo)))
    stop('Target info is not aligned to expression data - they must be in the same order')


# create a custom EListRaw object
  project <- new('EListRaw')
  project@.Data[[1]] <- 'illumina'
  project@.Data[[2]] <- targetinfo
  project@.Data[[3]] <- NULL
  project@.Data[[4]] <- x
  project@.Data[[5]] <- NULL
  project$E <- x
  project$targets <- targetinfo
  project$genes <- NULL
  project$other$Detection <- detectionpvalues


# generate QC plots of raw intensities
  dir.create('QC/')
  ...


# for BeadArrays, background correction and normalisation are handled by a single function: neqc()
  project.bgcorrect.norm <- neqc(project, offset = 16)


# filter out control probes, those with no symbol, and those that failed
  annot <- annot[which(annot$Probe_Id %in% rownames(project.bgcorrect.norm)),]
  project.bgcorrect.norm <- project.bgcorrect.norm[which(rownames(project.bgcorrect.norm) %in% annot$Probe_Id),]
  annot <- annot[match(rownames(project.bgcorrect.norm), annot$Probe_Id),]
  project.bgcorrect.norm@.Data[[3]] <- annot
  project.bgcorrect.norm$genes <- annot

  Control <- project.bgcorrect.norm$genes$Source=="ILMN_Controls"
  NoSymbol <- project.bgcorrect.norm$genes$Symbol == ""
  isexpr <- rowSums(project.bgcorrect.norm$other$Detection <= 0.05) >= 3

  project.bgcorrect.norm.filt <- project.bgcorrect.norm[!Control & !NoSymbol & isexpr, ]
  dim(project.bgcorrect.norm)
  dim(project.bgcorrect.norm.filt)


# remove annotation columns we no longer need
  project.bgcorrect.norm.filt$genes <- project.bgcorrect.norm.filt$genes[,c(
    'Probe_Id',
    'Definition','Ontology_Component','Ontology_Process','Ontology_Function',
    'Chromosome','Probe_Coordinates','Cytoband','Probe_Chr_Orientation',
    'RefSeq_ID','Entrez_Gene_ID','Symbol')]
  head(project.bgcorrect.norm.filt$genes)

# summarise across genes by mean - use ID
  project.bgcorrect.norm.filt.mean <- avereps(project.bgcorrect.norm.filt,
    ID = project.bgcorrect.norm.filt$genes$Symbol)
ADD COMMENTlink modified 7 weeks ago • written 10 months ago by Kevin Blighe69k

Hello Kevin,

I could not find target file for my dataset (GSE48072) on GEO neither could I locate one for GSE20159 mentioned here. Have I overlooked something?

Regarding filtering probes based on Detection P value:

So far, I have used lumi package for the analysis. I have tried using 'all_vars' and 'any_vars' function, which has given me two extreme outputs (all_vars=10k, any_vars=37k, P val <0.05 of the total 47k probes). I was thinking to filter out probes based on some condition like you have performed here (>=3). However, I would like to know why >=3 ? is it widely used? sorry for my naive question but I am not aware of it.

Regarding background correction:

Also, is it necessary to perform correction for beadarray? I was under impression that non-normalized data deposited on GEO for bead array are already background corrected. For now, I have managed to perform quantile normalization for the data using lumiN function. Which does not perform background correction. I am really not sure how I should go about this. But I tried using neqc for my expression set but I get following error:

(GSE48072 is the LumiBatchobject) neqc(GSE48072, offset = 16) Error in normexp.fit.detection.p(x, detection.p) : detection.p must be a numeric matrix (unless x is an EListRaw)

It would be really helpful if you can guide me.

Thanks

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by vipulwagh3110

Hello Kevin,

Hello

I could not find target file for my dataset (GSE48072) on GEO neither could I locate one for GSE20159 mentioned here. Have I overlooked something?

You will have to create it

However, I would like to know why >=3 ? is it widely used?

I use it because it is the default in the limma manual section on processing Illumina arrays. You can use any reasonable value

Also, is it necessary to perform correction for beadarray? I was under impression that non-normalized data deposited on GEO for bead array are already background corrected.

The data that is deposited on GEO may not be what you think. You should check the author-provided details by clicking on each GSM ID from the main GSE accession page and see if there is any extra information on how they pre-processed their data, and which data they deposited on GEO.

But I tried using neqc for my expression set but I get following error:

Assuming that you follow my exact code, which I would regard as being for [very] advanced users only, it should work. What is the output of str(detectionpvalues) Obviously it should have the exact same dimensions as the expression data

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Kevin Blighe69k

I posted further reproducible code, here, by the way:

ADD REPLYlink written 7 weeks ago by Kevin Blighe69k

Thanks, Kevin.

This clears most of my doubts.

About neqc(): No, I haven't followed your exact code. I just thought neqc might come to the rescue if these values aren't background-corrected. I did check GSM ids on GEO, but that dictates the entire preprocessing for the samples. And that doesn't clear if the non-normalized data is background-corrected or not. Maybe I will read more about it.

Thanks for the response.

ADD REPLYlink written 7 weeks ago by vipulwagh3110

Yes, I have no idea why GEO does not request the IDAT files to be uploaded, like it requests CEL files for Affymetrix and TXT files for Agilent.

You could check the distribution of the data via box-and-whiskers in order to determine whether or not it may have already been normalised.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Kevin Blighe69k
1

Sure I will do that.

Thank you once again.

ADD REPLYlink written 7 weeks ago by vipulwagh3110
2
gravatar for zhijianzheng007
6 months ago by
zhijianzheng00720 wrote:

hi,hope my comment could help you.I meet the same question like you. And I read limma's usersguide.Find a way to deal with the data

#background correction and normalisation

library(limma)
x <- read.ilmn(files="your-raw-data.txt",expr=".AVG_Signal",probeid="ID_REF",other.columns="Detection Pval")
y <- neqc(x,offset=16,detection.p="Detection Pval") 

#Filter out probes that are not expressed. We keep probes that are expressed in at least three arrays according to a detection p-values of 5%
expressed <- rowSums(y$other$Detection < 0.05) >= 3
y <- y[expressed,]
dim(y)

#output
expdata <- y$E
write.csv(expdata,file="F:/mydata/b.csv",quote=T,row.names = T)

I use Excel to proofread the ProbeID

I am not an expert in this field.If my code have something wrong´╝îhope others could tell me.

ADD COMMENTlink modified 6 months ago • written 6 months ago by zhijianzheng00720
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1745 users visited in the last hour
_