Question: build the expression matrix step by step from GEO raw data
0
gravatar for a511512345
11 weeks ago by
a51151234560
China guangxi nanning
a51151234560 wrote:

I downloaded the raw data for the Agilent-020382 Human Custom Microarray 44k (Feature Number version) platform from GEO, but I do not know how to build the expression matrix step by step. Can you help me? Thank you!

geo • 323 views
ADD COMMENTlink modified 11 weeks ago by Kevin Blighe11k • written 11 weeks ago by a51151234560
7
gravatar for Kevin Blighe
11 weeks ago by
Kevin Blighe11k
London/Brazil
Kevin Blighe11k wrote:

Hello and good afternoon / evening / morning,

I presume that you have downloaded the Agilent raw TXT files?

For 2-colour Agilent microarrays, the following generic pipeline should allow you to produce a normalised expression matrix and then conduct a simple diff. expression analysis (case-control):

Set-up

source("http://bioconductor.org/biocLite.R")

#Set 999 decimal places
options(scipen=999)

require("limma")

targetinfo <- readTargets("Targets.txt", sep="\t")

Targets.txt contains data in the format:

FileName WT_KO

SampleFiles/Array1.txt WT

SampleFiles/Array2.txt KO

SampleFiles/Array3.txt KO

SampleFiles/Array4.txt WT

  • SampleFiles is a directory in your current working directory
  • the 'FileName' column should remain as that (cases sensitive)
  • WT_KO just describes one condition of interest (can be any name)
  • You can have any number of conditions (as extra columns) in this file

Read in and normalise data

#Converts the data to a RGList (two-colour [red-green] array), with values for R, Rg, G, Gb
project <- read.maimages(targetinfo, source="agilent")

#Perform background correction on the fluorescent intensities
#'normexp' is beneficial in that it doesn't result in negative values, meaning no data is lost
project.bgcorrect <- backgroundCorrect(project, method="normexp", offset=16)

#Normalize the data with the 'loess' method
#LOESS performs local regression on subsets of the data, resulting in the generation of a 'regression curve' through it
project.bgcorrect.norm <- normalizeWithinArrays(project.bgcorrect, method="loess")

#For replicate probes in each sample, replace values with the average
project.bgcorrect.norm.avg <- avereps(project.bgcorrect.norm, ID=project.bgcorrect.norm$genes$ProbeName)

#Write out the expression values (using filename as headers)
temp <- project.bgcorrect.norm.avg$M
rownames(temp) <- project.bgcorrect.norm.avg$genes$GeneName
temp <- temp[order(rownames(temp), decreasing=FALSE),]
write.table(temp, "Results/MasterExpression.tsv", sep="\t", quote=FALSE, append=FALSE)

#Summarise over gene names (averaging counts over transcripts)
temp <- data.frame(cbind(data.frame(project.bgcorrect.norm.avg$genes$GeneName), data.frame(project.bgcorrect.norm.avg$M)))
project.bgcorrect.norm.avg.summarised <- aggregate(temp[,2:ncol(temp)], temp[1], mean)
rownames(project.bgcorrect.norm.avg.summarised) <- project.bgcorrect.norm.avg.summarised[,1]
project.bgcorrect.norm.avg.summarised <- project.bgcorrect.norm.avg.summarised[,2:ncol(project.bgcorrect.norm.avg.summarised)]
write.table(project.bgcorrect.norm.avg.summarised, "Results/MasterExpressionAverages.tsv", sep="\t", quote=FALSE, append=FALSE)

From this, you will have 2 files in your Results directory:

  • MasterExpression.tsv - normalised counts per target probe
  • MasterExpressionAverages.tsv - normalised counts summarised (by mean) over genes

QC

#Generate chip images to diagnose spatial artefacts for each array and a merged boxplot of intensities
image(project)

boxplot(project.bgcorrect.norm.avg.summarised, main="Background-corrected, RMA normalised, log2 expression values\nAll probes", col="royalblue", las=2)

#Principal components analysis
project.pca <- prcomp(t(project.bgcorrect.norm.avg.summarised), scale=TRUE)

#Determine the proportion of variance of each component
project.pca.proportionvariances <- ((project.pca$sdev^2) / (sum(project.pca$sdev^2)))*100

pairs(project.pca$x[,1:5], col="forestgreen", cex=0.8, main="Principal components analysis bi-plot\nPCs 1-5", pch=16)

Differential expression

#Create the study design and comparison model
design <- paste(targetinfo$WT_KO, sep="")
design <- factor(design)

comparisonmodel <- model.matrix(~0+design)
colnames(comparisonmodel) <- levels(design)

#Fit the linear model on the study's data
project.fitmodel <- lmFit(project.bgcorrect.norm.avg.summarised, comparisonmodel)

#Applying the empirical Bayes method to the fitted values acts as an extra normalization step and aims to bring the different probe-wise variances to common values.
project.fitmodel.eBayes <- eBayes(project.fitmodel)

names(project.fitmodel.eBayes)

#Make individual contrasts and fit the new model
CaseControl <- makeContrasts(CaseControl="KO-WT", levels=comparisonmodel)
CaseControl.fitmodel <- contrasts.fit(project.fitmodel.eBayes, CaseControl)
CaseControl.fitmodel.eBayes <- eBayes(CaseControl.fitmodel)

topTable(CaseControl.fitmodel.eBayes, adjust="fdr", coef="CaseControl", number=99999, p.value=1)
ADD COMMENTlink modified 7 weeks ago • written 11 weeks ago by Kevin Blighe11k
1

up voted and do you have a blog/github or any other repos for these work flows? I see that you have github profile and could you please make these (work flows) available there too?

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by cpad01124.1k
1

Hey cpad, thanks for the comment. This protocol above is specific for 2-colour Agilent arrays. The process is different for Affymetrix and also single-colour Agilent arrays.

I do have GitHub but do not put everything there (no time). I don't use Twitter or have any other blog.

I have been working or studying 365 days per year since 2005 (more or less), even on Christmas Day, and am only now on Biostars because I have an extended break. I have 1000s of scripts in my collection for all types of data.

I have been meaning to create a 'forum' post on Biostars about forming a community aimed at setting standards in bioinformatics, and creating standardised pipelines for processing data (like the pipeline/process above for Agilent 2-colour arrays). So, watch out for that as I will post it later today/tonight.

Thanks!

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Kevin Blighe11k
1

Great going! I saw you posting several work flows (in addition to present one here) across several posts (for other kinds of data too, in addition to agilent here) and all of them are useful to most visitors here including me. I was looking forward for one stop source for such well documented and tested workflows, posted by you, here. It seems such a thing is in offing and I would be very happy to look forward for it.

ADD REPLYlink written 11 weeks ago by cpad01124.1k

Hey Kevin,

Wouldn't you have a code for single-colour?

Agilent-014850 Whole Human Genome Microarray 4x44K G4112F

Thanks

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Leite200
1

Quite possibly somewhere on my computer. Let me check...

ADD REPLYlink written 7 weeks ago by Kevin Blighe11k
1

Hi Leite, I'm not sure that I performed work on the single-colour arrays previously. Have you tried the code above, nevertheless? I believe that the line read.maimages(targetinfo, source="agilent") will automatically identify whether it's a single- or two-colour array.

ADD REPLYlink written 7 weeks ago by Kevin Blighe11k

I'll try perfome it with this code and sent you a reply.

Thanks

ADD REPLYlink written 7 weeks ago by Leite200
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1005 users visited in the last hour