Question: How to process (seems) Agilent microarrry data?
1
gravatar for MatthewP
20 months ago by
MatthewP880
China
MatthewP880 wrote:

Hello, this is my first time to process microarray data, I don't even know what my data means. The data has too many columns to post here, I just show all colnames:

> colnames(df)
 [1] "FEATURES"                       "FeatureNum"                    
 [3] "Row"                            "Col"                           
 [5] "chr_coord"                      "SubTypeMask"                   
 [7] "SubTypeName"                    "Start"                         
 [9] "Sequence"                       "ProbeUID"                      
[11] "ControlType"                    "ProbeName"                     
[13] "GeneName"                       "SystematicName"                
[15] "Description"                    "PositionX"                     
[17] "PositionY"                      "gSurrogateUsed"                
[19] "gIsFound"                       "gProcessedSignal"              
[21] "gProcessedSigError"             "gNumPixOLHi"                   
[23] "gNumPixOLLo"                    "gNumPix"                       
[25] "gMeanSignal"                    "gMedianSignal"                 
[27] "gPixSDev"                       "gPixNormIQR"                   
[29] "gBGNumPix"                      "gBGMeanSignal"                 
[31] "gBGMedianSignal"                "gBGPixSDev"                    
[33] "gBGPixNormIQR"                  "gNumSatPix"                    
[35] "gIsSaturated"                   "gIsFeatNonUnifOL"              
[37] "gIsBGNonUnifOL"                 "gIsFeatPopnOL"                 
[39] "gIsBGPopnOL"                    "IsManualFlag"                  
[41] "gBGSubSignal"                   "gBGSubSigError"                
[43] "gIsPosAndSignif"                "gPValFeatEqBG"                 
[45] "gNumBGUsed"                     "gIsWellAboveBG"                
[47] "gBGUsed"                        "gBGSDUsed"                     
[49] "ErrorModel"                     "gSpatialDetrendIsInFilteredSet"
[51] "gSpatialDetrendSurfaceValue"    "SpotExtentX"                   
[53] "SpotExtentY"                    "gNetSignal"                    
[55] "gMultDetrendSignal"             "gProcessedBackground"          
[57] "gProcessedBkngError"            "IsUsedBGAdjust"                
[59] "gInterpolatedNegCtrlSub"        "gIsInNegCtrlRange"             
[61] "gIsUsedInMD"

This seems like Agilent microarray data, I want to know how to process to get expression matrix ? Any R package may be useful here? Thanks!

microarray rna • 1.9k views
ADD COMMENTlink modified 20 months ago by Kevin Blighe71k • written 20 months ago by MatthewP880
5
gravatar for Kevin Blighe
20 months ago by
Kevin Blighe71k
Republic of Ireland
Kevin Blighe71k wrote:

Edit September 5, 2019

NB - this original answer is for 1-colour (channel) Agilent data. Another generic pipeline for 2-colour Agilent is here: A: build the expression matrix step by step from GEO raw data

---------------

Limma can be used to process Agilent microarray data.

Assuming that your data is single colour / channel, which it appears to be, you should start with the raw files and a targets.txt file, which contains:

FileName        Group   Gender  Disease
raw/raw1.txt    Stage3  Male    NAFLD
raw/raw2.txt    Stage3  Female  NAFLD
raw/raw3.txt    Stage3  Female  NAFLD

I also refer to another file, Annot/Human_agilent_wholegenome_4x44k_v1_2019_06_30.tsv, which contains annotation produced by biomaRt, as follows:

# agilent_wholegenome_4x44k_v1
require(biomaRt)
mart <- useMart('ENSEMBL_MART_ENSEMBL')
mart <- useDataset('hsapiens_gene_ensembl', mart)
annotLookup <- getBM(
  mart = mart,
  attributes = c(
    'agilent_wholegenome_4x44k_v1',
    'wikigene_description',
    'ensembl_gene_id',
    'entrezgene',
    'gene_biotype',
    'external_gene_name'))
write.table(
  annotLookup,
  paste0('Human_agilent_wholegenome_4x44k_v1_', gsub("-", "_", as.character(Sys.Date())), '.tsv'),
  sep = '\t',
  row.names = FALSE,
  quote = FALSE)

Change the value of agilent_wholegenome_4x44k_v1 depending on the array that you are using.

-------------------------------

Then:

# general config
baseDir <- '.'
annotfile <- 'Annot/Human_agilent_wholegenome_4x44k_v1_2019_06_30.tsv'
targetsfile <- 'targets.txt'
setwd(baseDir)
options(scipen = 99)
require(limma)

# read in the data
# readTargets will by default look for the 'FileName' column in the specified file
targetinfo <- readTargets(targetsfile, sep = '\t')

# convert the data to an EListRaw object, which is a data object for single channel data
# specify green.only = TRUE for Agilent
# retain information about background via gIsWellAboveBG
project <- read.maimages(
  targetinfo,
  source = 'agilent.median',
  green.only = TRUE,
  other.columns = 'gIsWellAboveBG')
colnames(project) <- gsub('raw\\/', '', colnames(project))

# generate QC plots of raw intensities
dir.create('QC/')
# histograms, box, and density plots - use your own code

# annotate the probes
annotLookup <- read.csv(
  annotfile,
  header = TRUE,
  sep = '\t',
  stringsAsFactors = FALSE)
colnames(annotLookup)[1] <- 'AgilentID'
annotLookup <- annotLookup[which(annotLookup$AgilentID %in% project$genes$ProbeName),]
annotLookup <- annotLookup[match(project$genes$ProbeName, annotLookup$AgilentID),]
table(project$genes$ProbeName == annotLookup$AgilentID) # check that annots are aligned
project$genes$AgilentID <- annotLookup$AgilentID
project$genes$wikigene_description <- annotLookup$wikigene_description
project$genes$ensembl_gene_id <- annotLookup$ensembl_gene_id
project$genes$entrezgene <- annotLookup$entrezgene
project$genes$gene_biotype <- annotLookup$gene_biotype
project$genes$external_gene_name <- annotLookup$external_gene_name

# perform background correction on the fluorescent intensities
project.bgcorrect <- backgroundCorrect(project, method = 'normexp')

# normalize the data with the 'quantile' method
project.bgcorrect.norm <- normalizeBetweenArrays(project.bgcorrect, method = 'quantile')

# filter out control probes, those with no symbol, and those that fail:
Control <- project.bgcorrect.norm$genes$ControlType==1L
NoSymbol <- is.na(project.bgcorrect.norm$genes$external_gene_name)
IsExpr <- rowSums(project.bgcorrect.norm$other$gIsWellAboveBG > 0) >= 4
project.bgcorrect.norm.filt <- project.bgcorrect.norm[!Control & !NoSymbol & IsExpr, ]
dim(project.bgcorrect.norm)
dim(project.bgcorrect.norm.filt)

# remove annotation columns we no longer need
project.bgcorrect.norm.filt$genes <- project.bgcorrect.norm.filt$genes[,c(
  'ProbeName','wikigene_description','ensembl_gene_id','entrezgene','gene_biotype','external_gene_name')]
head(project.bgcorrect.norm.filt$genes)

# for replicate probes, replace values with the mean
# ID is used to identify the replicates
project.bgcorrect.norm.filt.mean <- avereps(project.bgcorrect.norm.filt,
  ID = project.bgcorrect.norm.filt$genes$ProbeName)
dim(project.bgcorrect.norm.filt.mean)

# generate QC plots of normalised intensities
# histograms, box, and density plots - use your own code

The normalised expression matrix is then contained in project.bgcorrect.norm.filt.mean$E

Most of this is written in the Limma manual: https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf

Kevin

ADD COMMENTlink modified 18 months ago • written 20 months ago by Kevin Blighe71k

Also see https://f1000research.com/articles/5-1384/v2 for inspiration.

ADD REPLYlink written 20 months ago by ATpoint46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2086 users visited in the last hour
_