Limma Analysis For Two-Channeled Microarray Data Fetched Using Geoquery
1
7
Entering edit mode
9.8 years ago
Gotgenes ▴ 460

EDIT 2011-06-21: In the original version of this question, I misstated that the Cy3 channel contained a measure of a common reference from pooled controls. Instead, the Cy3 channels are actually a measure of the non-exposed culture from the same animal.

This question is very similar to an unanswered question asked on the Bioconductor mailing list. (The only response I found was unhelpful.) I will re-cast it in the particular problem I'm trying to solve.

GSE13465 contains experiments on two organisms:

• human
• rat

I only care about data from the rat cultures. (The human data is present due to submitter error.)

Three biological replicates were constructed and measured.

For each biological replicate, the rat cultures took place in two media types:

• rat hepatocytes in normal media
• rat hepatocytes in modified media

Each media condition was paired with an acetaminophen (APAP) treatment condition:

• no exposure (0 mM APAP), the control
• exposure to 5 mM APAP
• exposure to 10 mM APAP

Expression levels were measured using the two-channel Agilent 011868 Rat Oligo Microarray G4130A (GPL890). All gene expression measurements were performed with the labels as such:

• Channel 1 (Cy5): 0 mM APAP (control) or 5 or 10 mM APAP (treatment)
• Channel 2 (Cy3): 0 mM APAP (control)

I would like to perform the following contrasts:

• 5 mM APAP vs. control, standard media
• 10 mM APAP vs. control, standard media
• 5 mM APAP vs. control, modified media
• 10 mM APAP vs. control, modified media

Using GEOquery, I obtained the gene expression data for the GEO Series

> library(GEOquery)
> library(limma)
> gse13465 <- getGEO("GSE13465", GSEMatrix = TRUE)
Found 2 file(s)
GSE13465-GPL887_series_matrix.txt.gz
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
100 1254k  100 1254k    0     0   302k      0  0:00:04  0:00:04 --:--:--  315k
File stored at:
/tmp/Rtmp2VCw8p/GPL887.soft
GSE13465-GPL890_series_matrix.txt.gz
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
100 1604k  100 1604k    0     0   452k      0  0:00:03  0:00:03 --:--:--  474k
File stored at:
/tmp/Rtmp2VCw8p/GPL890.soft
> ratdata <- gse13465[[2]]  # The rat data is in the second experiment
> ratdata
ExpressionSet (storageMode: lockedEnvironment)
assayData: 20501 features, 18 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM339437 GSM339438 ... GSM339454 (18 total)
varLabels: title geo_accession ... data_row_count (42 total)
featureData
featureNames: 3 4 ... 22572 (20501 total)
fvarLabels: ID COL ... SEQUENCE (20 total)
experimentData: use 'experimentData(object)'
Annotation: GPL890


I'm uncertain how to procede from here, however. The expression values that are in the ExpressionSet for the rat data are described by the submitters as the "Lowess [sic] normalized log 2 ratio (Cy5/Cy3); signal calculated using Imagene 5.0 and GeneSight 4.1 software (Biodiscovery, Palo Alto, CA)." I do not really understand how this relates to an MAList, which is the usual input for tha lmFit function. I tried the obvious, but it did not work:

> MA <- as(ratdata, "MAList")
Error in as(ratdata, "MAList") :
no method or default for coercing "ExpressionSet" to "MAList"


I'm also not completely certain for how to set up the fit and contrast matrices; in this case there are not measurements from each channel but rather a single value per probeset representing the log-2 ratio.

I would really appreciate any help in performing this contrast, both from the standpoint of helping my research but also just wrapping my head around R/Bioconductor, and limma and GEOquery in particular.

bioconductor limma geo • 6.9k views
0
Entering edit mode

I'm working on an answer, but my initial thought is that you might need to grab the supplementary data files. These contain separate files with intensities for Cy5 and Cy3, which I think you'll need for limma.

0
Entering edit mode

I thought this too, Neil, but I got even more confused because I do not recognize the file formats. From reading limma User's Guide, I thought both the Cy5 and Cy3 intensities should be in the same file, then read in with read.maimages(targets, source="agilent"), but I take it these are not the Agilent files.

0
Entering edit mode

It's very likely that the format is just "some kind of delimited text", rather than anything specific. There are some limma examples floating around the Web describing how to get that type of file into the appropriate Bioconductor object.

0
Entering edit mode

I'm actually guessing they're ImaGene files based on the limma User's Guide stating, "ImaGene stores red and green intensities in separate files," (Section 4.3, page 16 of the PDF as of limma v.3.8.2), there being separate Cy3 and Cy5 files in the supplementary material of the GSMs, and the VALUE description in the GSMs stating, "signal calculated using Imagene 5.0 and GeneSight 4.1 software."

5
Entering edit mode
9.8 years ago

First point: I'm not a limma pro, but lmFit() in lima 3.8.1 (what I currently have installed) is advertised to accept an "object of class numeric, matrix, MAList, EList, marrayNorm, ExpressionSet or PLMset containing log-ratios or log-values of expression for a series of microarrays", so a valid ExpressionSet should be kosher. The Detail notes that "The expression data should be log-ratios for two-color array platforms"; this, I imagine, accounts for the perfunctory nature of the response on the distribution list.

That said:

I pull down and look at a lot of GEO datasets, and I never use processed results when the raw data are available. You can't detect batch effects that stem from scan date without the raw data. That will give you terrible, nasty problems if you don't detect it. You get complete control over the analysis. Unless you have no other choice, I advise you to go to the trouble of re-normalizing the data.

3
Entering edit mode

Just to expand on David's answer: you call exprs(ratdata) to get the matrix of log2 values, which you can then pass to lmFit() along with a design matrix.

2
Entering edit mode

I'm pretty sure this is ImaGene format, as there's an ImaGene 6.1.0 (see the version listed in the file header. Figured that out from google searching "6.1.0" agilent microarray). From the docs for read.maImages, "For data from ImaGene versions 1 to 8 (source="imagene"), the argument files should be a matrix with two columns. The first column should contain the names of the files containing green channel (cy3) data and the second column should contain names of files containing red channel (cy5) data." That would fit with two separate files. Forensic Bioinformatics!

2
Entering edit mode

I'm accepting this answer, summarized simply as, "Don't use GEOquery to obtain gene expression measurements; use raw data." GEOquery is nice for grabbing meta-information and information about the platform probes, but the expression data deposited in GEO is a complete grab-bag and requires close scrutiny for each sample and series.

0
Entering edit mode

We always start from raw data where available too. You just don't know what the submitter did, even when they say what they did :-)

0
Entering edit mode

David, it's helpful to know that using the raw data is preferable. The raw expression files are available separately for the Cy5 and Cy3 data for each of the samples (e.g. see http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM339445 ) whereas the limma User's Guide describes scenarios when the Cy3 and Cy5 data are in the same file, so I'm unsure how to load them up with read.maimages().

0
Entering edit mode

@David I arrived at the same conclusion; I kept checking back at Biostar but for whatever reason didn't get a notification that you added your comment. Thanks for your selfless sleuthing.