Question: build the expression matrix step by step from GEO raw data
1
gravatar for a511512345
2.6 years ago by
a511512345110
China guangxi nanning
a511512345110 wrote:

I downloaded the raw data for the Agilent-020382 Human Custom Microarray 44k (Feature Number version) platform from GEO, but I do not know how to build the expression matrix step by step. Can you help me? Thank you!

geo • 3.2k views
ADD COMMENTlink modified 2.6 years ago by Kevin Blighe60k • written 2.6 years ago by a511512345110
10
gravatar for Kevin Blighe
2.6 years ago by
Kevin Blighe60k
Kevin Blighe60k wrote:

Edit September 5, 2019

NB - this original answer is for 2-colour (channel) Agilent data. Another generic pipeline for 1-colour Agilent is here: A: How to process (seems) Agilent microarrry data?

---------------------

I presume that you have downloaded the Agilent raw TXT files?

For 2-colour (channel) Agilent Microarrays, the following generic pipeline should allow you to produce a normalised expression matrix and perform a simple differential expression analysis (case-control):

Set-up

# Set 9 decimal places
options(scipen =9 )

require('limma')

targetinfo <- readTargets('Targets.txt', sep = '\t')

Targets.txt contains data in the format:

FileName                WT_KO
SampleFiles/Array1.txt  WT
SampleFiles/Array2.txt  KO
SampleFiles/Array3.txt  KO
SampleFiles/Array4.txt  WT
  • SampleFiles is a directory in your current working directory
  • the 'FileName' column should remain as that (cases sensitive)
  • WT_KO just describes one condition of interest (can be any name)
  • You can have any number of conditions (as extra columns) in this file

Read in and normalise data

# Converts the data to a RGList (two-colour [red-green] array), with values for R, Rg, G, Gb
project <- read.maimages(targetinfo, source = 'agilent')

# Perform background correction on the fluorescent intensities
project.bgcorrect <- backgroundCorrect(project, method = 'normexp', offset = 16)

# Normalize the data with the 'loess' method
project.bgcorrect.norm <- normalizeWithinArrays(project.bgcorrect, method = 'loess')

# For replicate probes in each sample, replace values with the average
project.bgcorrect.norm.avg <- avereps(
  project.bgcorrect.norm,
  ID = project.bgcorrect.norm$genes$ProbeName)

QC

# Generate chip images to diagnose spatial artefacts
image(project)

# box-and-whiskers
boxplot(
  project.bgcorrect.norm.avg,
  col = "royalblue",
  las = 2)

# PCA
p <- prcomp(t(project.bgcorrect.norm.avg), scale = TRUE)

# Determine the proportion of variance of each component
proportionvariances <- ((p$sdev^2) / (sum(p$sdev^2)))*100

pairs(
  p$x[,1:5],
  col = "forestgreen",
  cex = 0.8,
  main = "Principal components analysis bi-plot\nPCs 1-5",
  pch = 16)

Differential expression

# Create the study design
design <- model.matrix(~ 0 + factor(targetinfo$WT_KO, levels = c('WT', 'KO')))
colnames(design) <- c('WT', 'KO')

# Fit the linear model on the study's data
project.fitmodel <- lmFit(
  project.bgcorrect.norm.avg,
  design)

# Applying the empirical Bayes method to the fitted values
# Acts as an extra normalisation step and aims to bring the different probe-wise variances to common values
project.fitmodel.eBayes <- eBayes(project.fitmodel)
names(project.fitmodel.eBayes)

# Make individual contrasts
CaseControl <- makeContrasts(CaseControl = 'KO-WT', levels = design)
CaseControl.fitmodel <- contrasts.fit(project.fitmodel.eBayes, CaseControl)
CaseControl.fitmodel.eBayes <- eBayes(CaseControl.fitmodel)

topTable(
  CaseControl.fitmodel.eBayes,
  adjust = 'BH',
  coef = "CaseControl",
  number = 99999,
  p.value = 1)
ADD COMMENTlink modified 8 days ago • written 2.6 years ago by Kevin Blighe60k
1

up voted and do you have a blog/github or any other repos for these work flows? I see that you have github profile and could you please make these (work flows) available there too?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by cpad011213k
2

Hey cpad, thanks for the comment. This protocol above is specific for 2-colour Agilent arrays. The process is different for Affymetrix and also single-colour Agilent arrays.

I do have GitHub but do not put everything there (no time). I don't use Twitter or have any other blog.

I have been working or studying 365 days per year since 2005 (more or less), even on Christmas Day, and am only now on Biostars because I have an extended break. I have 1000s of scripts in my collection for all types of data.

I have been meaning to create a 'forum' post on Biostars about forming a community aimed at setting standards in bioinformatics, and creating standardised pipelines for processing data (like the pipeline/process above for Agilent 2-colour arrays). So, watch out for that as I will post it later today/tonight.

Thanks!

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Kevin Blighe60k
1

Great going! I saw you posting several work flows (in addition to present one here) across several posts (for other kinds of data too, in addition to agilent here) and all of them are useful to most visitors here including me. I was looking forward for one stop source for such well documented and tested workflows, posted by you, here. It seems such a thing is in offing and I would be very happy to look forward for it.

ADD REPLYlink written 2.6 years ago by cpad011213k

Hey Kevin,

Wouldn't you have a code for single-colour?

Agilent-014850 Whole Human Genome Microarray 4x44K G4112F

Thanks

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Leite980
2

Quite possibly somewhere on my computer. Let me check...

ADD REPLYlink written 2.5 years ago by Kevin Blighe60k
2

Hi Leite, I'm not sure that I performed work on the single-colour arrays previously. Have you tried the code above, nevertheless? I believe that the line read.maimages(targetinfo, source="agilent") will automatically identify whether it's a single- or two-colour array.

ADD REPLYlink written 2.5 years ago by Kevin Blighe60k

I'll try perfome it with this code and sent you a reply.

Thanks

ADD REPLYlink written 2.5 years ago by Leite980
1

Hi, thank you for this post, it's very helpful! I have a question about raw file format requirement.

I am also trying to analyse Agilent microarray found on GEO with platform referenced as "Homo sapiens 1X44K Custom Array". It was indicated that "Raw data were included within Sample table". I downloaded them from this link : https://ftp.ncbi.nlm.nih.gov/geo/series/GSE22nnn/GSE22358/miniml/GSE22358_family.xml.tgz Then, I unzipped them and I tried to processed them with the read.maimages function from Limma package

There was no Targets.txt file or something close to it, so I did:

target<-list.files("GSE22358_family.xml",pattern = "^GSM")

The .txt files did not have column names, but I found them here:

"ID_REF", "VALUE", "SPOT", "CH1_MEAN", "CH1_SD", "CH1_BKD_MEDIAN", "CH1_BKD_SD", "CH2_MEAN", "CH2_SD", "CH2_BKD_MEDIAN", "CH2_BKD_SD", "TOT_BPIX", "TOT_SPIX", "CH2BN_MEDIAN", "CH2IN_MEAN", "CH1DL_MEAN","CH2DL_MEAN", "LOG_RAT2N_MEAN", "CORR", "FLAG"

I replaced CH1_MEAN and CH2_MEAN by G and R (I chose color randomly because I did not find information about it), and CH1_BKD_MEDIAN and CH2_BKD_MEDIAN by Gb and Rb. I added them manually as header to the .txt files, then I ran this code:

col_names <- c("ID_REF", "VALUE", "SPOT", "G", "CH1_SD", "Gb", "CH1_BKD_SD", "R",  "CH2_SD", "Rb", "CH2_BKD_SD", "TOT_BPIX", "TOT_SPIX", 
               "CH2BN_MEDIAN", "CH2IN_MEAN", "CH1DL_MEAN", "CH2DL_MEAN", "LOG_RAT2N_MEAN", "CORR", "FLAG")

col_names <-setNames(as.character(col_names),  col_names)
target <- read.maimages(files=target, columns = col_names, source="agilent")

but I had this error message:

 Warning message:
 In read.maimages(files = target, columns = col_names,  :
   non-standard columns specified

I kept going with your code

 project.bgcorrect <- backgroundCorrect(dat, method="normexp", offset=16)
 project.bgcorrect.norm <- normalizeWithinArrays(project.bgcorrect, method="loess")
 temp <- project.bgcorrect.norm$M

and it seems to work, but I just wanted to be sure that I was doing things properly:

How to know which colors are CH1 and CH2? Are CH1/2_MEAN and CH1/2_BKD_MEDIAN really correspond to G/R and Gb/Rb ? If I am doing right, is there an automatised way to add the same column header to all the .txt files with R? (and to do the same with row names to replace the custom IDs they put by gene symbol IDs?)

ADD REPLYlink modified 23 months ago • written 23 months ago by josephine.tidiane10
1

Hey Josephine, please note that you can obtain the normalised data by just following the R script here: https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE22358 (click on the R script tab).

...or did you want to actually start from the raw data stage?

ADD REPLYlink modified 23 months ago • written 23 months ago by Kevin Blighe60k

I need to start from raw data because I want to do a multi-series analysis and I need to use the same preprocessing algorithm for all series

ADD REPLYlink written 23 months ago by josephine.tidiane10
1

Okay, sorry, my time s quite limited to look at this in greater detail right now (I will have time later). Did you try any of the other sources that are possible with the read.maimages() function?

You can see the available options Here - there are a few for Agilent arrays.

The warning message that you received is somewhat worrying, as wrong values may have been used for normalisation. Did you generate a boxplot to see if it looks normalised, nevertheless?

ADD REPLYlink written 23 months ago by Kevin Blighe60k

Hi Kevin, I looked into maimages function code and I found that: scanarrayexpress = list(G = "Ch1 Mean", Gb = "Ch1 B Median", R = "Ch2 Mean", Rb = "Ch2 B Median"),

I also found that: if (!all(cnames %in% c("G", "R", "Gb", "Rb", "E", "Eb"))) warning("non-standard columns specified")

So I guess the warning messsage is not a big deal as long as I have my "G", "R", "Gb", "Rb" labelled correctly, which seems to be the case according to the scanarrayexpress list I found in the code.

I also looked at boxplot and it seems well normalised. Thank you for your time and for your help.

ADD REPLYlink written 23 months ago by josephine.tidiane10

Cool, yes, looks like you have done the mapping correctly and that the warning can be ignored.

ADD REPLYlink written 23 months ago by Kevin Blighe60k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 749 users visited in the last hour