How to process (seems) Agilent microarrry data?
1
6
Entering edit mode
4.8 years ago
MatthewP ★ 1.4k

Hello, this is my first time to process microarray data, I don't even know what my data means. The data has too many columns to post here, I just show all colnames:

> colnames(df)
 [1] "FEATURES"                       "FeatureNum"                    
 [3] "Row"                            "Col"                           
 [5] "chr_coord"                      "SubTypeMask"                   
 [7] "SubTypeName"                    "Start"                         
 [9] "Sequence"                       "ProbeUID"                      
[11] "ControlType"                    "ProbeName"                     
[13] "GeneName"                       "SystematicName"                
[15] "Description"                    "PositionX"                     
[17] "PositionY"                      "gSurrogateUsed"                
[19] "gIsFound"                       "gProcessedSignal"              
[21] "gProcessedSigError"             "gNumPixOLHi"                   
[23] "gNumPixOLLo"                    "gNumPix"                       
[25] "gMeanSignal"                    "gMedianSignal"                 
[27] "gPixSDev"                       "gPixNormIQR"                   
[29] "gBGNumPix"                      "gBGMeanSignal"                 
[31] "gBGMedianSignal"                "gBGPixSDev"                    
[33] "gBGPixNormIQR"                  "gNumSatPix"                    
[35] "gIsSaturated"                   "gIsFeatNonUnifOL"              
[37] "gIsBGNonUnifOL"                 "gIsFeatPopnOL"                 
[39] "gIsBGPopnOL"                    "IsManualFlag"                  
[41] "gBGSubSignal"                   "gBGSubSigError"                
[43] "gIsPosAndSignif"                "gPValFeatEqBG"                 
[45] "gNumBGUsed"                     "gIsWellAboveBG"                
[47] "gBGUsed"                        "gBGSDUsed"                     
[49] "ErrorModel"                     "gSpatialDetrendIsInFilteredSet"
[51] "gSpatialDetrendSurfaceValue"    "SpotExtentX"                   
[53] "SpotExtentY"                    "gNetSignal"                    
[55] "gMultDetrendSignal"             "gProcessedBackground"          
[57] "gProcessedBkngError"            "IsUsedBGAdjust"                
[59] "gInterpolatedNegCtrlSub"        "gIsInNegCtrlRange"             
[61] "gIsUsedInMD"

This seems like Agilent microarray data, I want to know how to process to get expression matrix ? Any R package may be useful here? Thanks!

microarray rna • 6.9k views
ADD COMMENT
12
Entering edit mode
4.8 years ago

Edit September 5, 2019

NB - this original answer is for 1-colour (channel) Agilent data. Another generic pipeline for 2-colour Agilent is here: A: build the expression matrix step by step from GEO raw data

---------------

Limma can be used to process Agilent microarray data.

Assuming that your data is single colour / channel, which it appears to be, you should start with the raw files and a targets.txt file, which contains:

FileName        Group   Gender  Disease
raw/raw1.txt    Stage3  Male    NAFLD
raw/raw2.txt    Stage3  Female  NAFLD
raw/raw3.txt    Stage3  Female  NAFLD

I also refer to another file, Annot/Human_agilent_wholegenome_4x44k_v1_2019_06_30.tsv, which contains annotation produced by biomaRt, as follows:

# agilent_wholegenome_4x44k_v1
require(biomaRt)
mart <- useMart('ENSEMBL_MART_ENSEMBL')
mart <- useDataset('hsapiens_gene_ensembl', mart)
annotLookup <- getBM(
  mart = mart,
  attributes = c(
    'agilent_wholegenome_4x44k_v1',
    'wikigene_description',
    'ensembl_gene_id',
    'entrezgene',
    'gene_biotype',
    'external_gene_name'))
write.table(
  annotLookup,
  paste0('Human_agilent_wholegenome_4x44k_v1_', gsub("-", "_", as.character(Sys.Date())), '.tsv'),
  sep = '\t',
  row.names = FALSE,
  quote = FALSE)

Change the value of agilent_wholegenome_4x44k_v1 depending on the array that you are using.

-------------------------------

Then:

# general config
baseDir <- '.'
annotfile <- 'Annot/Human_agilent_wholegenome_4x44k_v1_2019_06_30.tsv'
targetsfile <- 'targets.txt'
setwd(baseDir)
options(scipen = 99)
require(limma)

# read in the data
# readTargets will by default look for the 'FileName' column in the specified file
targetinfo <- readTargets(targetsfile, sep = '\t')

# convert the data to an EListRaw object, which is a data object for single channel data
# specify green.only = TRUE for Agilent
# retain information about background via gIsWellAboveBG
project <- read.maimages(
  targetinfo,
  source = 'agilent.median',
  green.only = TRUE,
  other.columns = 'gIsWellAboveBG')
colnames(project) <- gsub('raw\\/', '', colnames(project))

# generate QC plots of raw intensities
dir.create('QC/')
# histograms, box, and density plots - use your own code

# annotate the probes
annotLookup <- read.csv(
  annotfile,
  header = TRUE,
  sep = '\t',
  stringsAsFactors = FALSE)
colnames(annotLookup)[1] <- 'AgilentID'
annotLookup <- annotLookup[which(annotLookup$AgilentID %in% project$genes$ProbeName),]
annotLookup <- annotLookup[match(project$genes$ProbeName, annotLookup$AgilentID),]
table(project$genes$ProbeName == annotLookup$AgilentID) # check that annots are aligned
project$genes$AgilentID <- annotLookup$AgilentID
project$genes$wikigene_description <- annotLookup$wikigene_description
project$genes$ensembl_gene_id <- annotLookup$ensembl_gene_id
project$genes$entrezgene <- annotLookup$entrezgene
project$genes$gene_biotype <- annotLookup$gene_biotype
project$genes$external_gene_name <- annotLookup$external_gene_name

# perform background correction on the fluorescent intensities
project.bgcorrect <- backgroundCorrect(project, method = 'normexp')

# normalize the data with the 'quantile' method
project.bgcorrect.norm <- normalizeBetweenArrays(project.bgcorrect, method = 'quantile')

# filter out control probes, those with no symbol, and those that fail:
Control <- project.bgcorrect.norm$genes$ControlType==1L
NoSymbol <- is.na(project.bgcorrect.norm$genes$external_gene_name)
IsExpr <- rowSums(project.bgcorrect.norm$other$gIsWellAboveBG > 0) >= 4
project.bgcorrect.norm.filt <- project.bgcorrect.norm[!Control & !NoSymbol & IsExpr, ]
dim(project.bgcorrect.norm)
dim(project.bgcorrect.norm.filt)

# remove annotation columns we no longer need
project.bgcorrect.norm.filt$genes <- project.bgcorrect.norm.filt$genes[,c(
  'ProbeName','wikigene_description','ensembl_gene_id','entrezgene','gene_biotype','external_gene_name')]
head(project.bgcorrect.norm.filt$genes)

# for replicate probes, replace values with the mean
# ID is used to identify the replicates
project.bgcorrect.norm.filt.mean <- avereps(project.bgcorrect.norm.filt,
  ID = project.bgcorrect.norm.filt$genes$ProbeName)
dim(project.bgcorrect.norm.filt.mean)

# generate QC plots of normalised intensities
# histograms, box, and density plots - use your own code

The normalised expression matrix is then contained in project.bgcorrect.norm.filt.mean$E

Most of this is written in the Limma manual: https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf

Kevin

ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Kevin Blighe Hi, wondering if this is within-array normalization or between-array normalization? if it is within-array normalization, how can we do between-array normalization? Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 2717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6