Question

How to process (seems) Agilent microarrry data?

6

Entering edit mode

4.8 years ago

MatthewP ★ 1.4k

Hello, this is my first time to process microarray data, I don't even know what my data means. The data has too many columns to post here, I just show all colnames:

> colnames(df)
 [1] "FEATURES"                       "FeatureNum"                    
 [3] "Row"                            "Col"                           
 [5] "chr_coord"                      "SubTypeMask"                   
 [7] "SubTypeName"                    "Start"                         
 [9] "Sequence"                       "ProbeUID"                      
[11] "ControlType"                    "ProbeName"                     
[13] "GeneName"                       "SystematicName"                
[15] "Description"                    "PositionX"                     
[17] "PositionY"                      "gSurrogateUsed"                
[19] "gIsFound"                       "gProcessedSignal"              
[21] "gProcessedSigError"             "gNumPixOLHi"                   
[23] "gNumPixOLLo"                    "gNumPix"                       
[25] "gMeanSignal"                    "gMedianSignal"                 
[27] "gPixSDev"                       "gPixNormIQR"                   
[29] "gBGNumPix"                      "gBGMeanSignal"                 
[31] "gBGMedianSignal"                "gBGPixSDev"                    
[33] "gBGPixNormIQR"                  "gNumSatPix"                    
[35] "gIsSaturated"                   "gIsFeatNonUnifOL"              
[37] "gIsBGNonUnifOL"                 "gIsFeatPopnOL"                 
[39] "gIsBGPopnOL"                    "IsManualFlag"                  
[41] "gBGSubSignal"                   "gBGSubSigError"                
[43] "gIsPosAndSignif"                "gPValFeatEqBG"                 
[45] "gNumBGUsed"                     "gIsWellAboveBG"                
[47] "gBGUsed"                        "gBGSDUsed"                     
[49] "ErrorModel"                     "gSpatialDetrendIsInFilteredSet"
[51] "gSpatialDetrendSurfaceValue"    "SpotExtentX"                   
[53] "SpotExtentY"                    "gNetSignal"                    
[55] "gMultDetrendSignal"             "gProcessedBackground"          
[57] "gProcessedBkngError"            "IsUsedBGAdjust"                
[59] "gInterpolatedNegCtrlSub"        "gIsInNegCtrlRange"             
[61] "gIsUsedInMD"

This seems like Agilent microarray data, I want to know how to process to get expression matrix ? Any R package may be useful here? Thanks!

microarray rna • 6.9k views

ADD COMMENT • link updated 10 months ago by survive • 0 • written 4.8 years ago by MatthewP ★ 1.4k

score 12 · Accepted Answer · 2019-07-10

Edit September 5, 2019

NB - this original answer is for 1-colour (channel) Agilent data. Another generic pipeline for 2-colour Agilent is here: A: build the expression matrix step by step from GEO raw data

---------------

Limma can be used to process Agilent microarray data.

Assuming that your data is single colour / channel, which it appears to be, you should start with the raw files and a targets.txt file, which contains:

FileName        Group   Gender  Disease
raw/raw1.txt    Stage3  Male    NAFLD
raw/raw2.txt    Stage3  Female  NAFLD
raw/raw3.txt    Stage3  Female  NAFLD

I also refer to another file, Annot/Human_agilent_wholegenome_4x44k_v1_2019_06_30.tsv, which contains annotation produced by biomaRt, as follows:

# agilent_wholegenome_4x44k_v1
require(biomaRt)
mart <- useMart('ENSEMBL_MART_ENSEMBL')
mart <- useDataset('hsapiens_gene_ensembl', mart)
annotLookup <- getBM(
  mart = mart,
  attributes = c(
    'agilent_wholegenome_4x44k_v1',
    'wikigene_description',
    'ensembl_gene_id',
    'entrezgene',
    'gene_biotype',
    'external_gene_name'))
write.table(
  annotLookup,
  paste0('Human_agilent_wholegenome_4x44k_v1_', gsub("-", "_", as.character(Sys.Date())), '.tsv'),
  sep = '\t',
  row.names = FALSE,
  quote = FALSE)

Change the value of agilent_wholegenome_4x44k_v1 depending on the array that you are using.

-------------------------------

Then:

# general config
baseDir <- '.'
annotfile <- 'Annot/Human_agilent_wholegenome_4x44k_v1_2019_06_30.tsv'
targetsfile <- 'targets.txt'
setwd(baseDir)
options(scipen = 99)
require(limma)

# read in the data
# readTargets will by default look for the 'FileName' column in the specified file
targetinfo <- readTargets(targetsfile, sep = '\t')

# convert the data to an EListRaw object, which is a data object for single channel data
# specify green.only = TRUE for Agilent
# retain information about background via gIsWellAboveBG
project <- read.maimages(
  targetinfo,
  source = 'agilent.median',
  green.only = TRUE,
  other.columns = 'gIsWellAboveBG')
colnames(project) <- gsub('raw\\/', '', colnames(project))

# generate QC plots of raw intensities
dir.create('QC/')
# histograms, box, and density plots - use your own code

# annotate the probes
annotLookup <- read.csv(
  annotfile,
  header = TRUE,
  sep = '\t',
  stringsAsFactors = FALSE)
colnames(annotLookup)[1] <- 'AgilentID'
annotLookup <- annotLookup[which(annotLookup$AgilentID %in% project$genes$ProbeName),]
annotLookup <- annotLookup[match(project$genes$ProbeName, annotLookup$AgilentID),]
table(project$genes$ProbeName == annotLookup$AgilentID) # check that annots are aligned
project$genes$AgilentID <- annotLookup$AgilentID
project$genes$wikigene_description <- annotLookup$wikigene_description
project$genes$ensembl_gene_id <- annotLookup$ensembl_gene_id
project$genes$entrezgene <- annotLookup$entrezgene
project$genes$gene_biotype <- annotLookup$gene_biotype
project$genes$external_gene_name <- annotLookup$external_gene_name

# perform background correction on the fluorescent intensities
project.bgcorrect <- backgroundCorrect(project, method = 'normexp')

# normalize the data with the 'quantile' method
project.bgcorrect.norm <- normalizeBetweenArrays(project.bgcorrect, method = 'quantile')

# filter out control probes, those with no symbol, and those that fail:
Control <- project.bgcorrect.norm$genes$ControlType==1L
NoSymbol <- is.na(project.bgcorrect.norm$genes$external_gene_name)
IsExpr <- rowSums(project.bgcorrect.norm$other$gIsWellAboveBG > 0) >= 4
project.bgcorrect.norm.filt <- project.bgcorrect.norm[!Control & !NoSymbol & IsExpr, ]
dim(project.bgcorrect.norm)
dim(project.bgcorrect.norm.filt)

# remove annotation columns we no longer need
project.bgcorrect.norm.filt$genes <- project.bgcorrect.norm.filt$genes[,c(
  'ProbeName','wikigene_description','ensembl_gene_id','entrezgene','gene_biotype','external_gene_name')]
head(project.bgcorrect.norm.filt$genes)

# for replicate probes, replace values with the mean
# ID is used to identify the replicates
project.bgcorrect.norm.filt.mean <- avereps(project.bgcorrect.norm.filt,
  ID = project.bgcorrect.norm.filt$genes$ProbeName)
dim(project.bgcorrect.norm.filt.mean)

# generate QC plots of normalised intensities
# histograms, box, and density plots - use your own code

The normalised expression matrix is then contained in project.bgcorrect.norm.filt.mean$E

Most of this is written in the Limma manual: https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf

Kevin