I'm trying to get the data for the following publication:
The specific GEO accession I'm trying to retrieve is GSE7606 (its a sub-series of GSE7615, the full dataset for the paper mentioned). I've used GEOquery to download the supplementary files for the series and extracted them:
require(GEOquery) gseid <- "GSE7606" supp.melanoma <- getGEOSuppFiles(gseid) ## manually un-tar/gunzip them
Since they are CGH profiles, I'm reading them as such:
require(limma) datapath <- "/path/to/data/GSE7606/" filenames <- list.files(datapath, pattern="GSM.*.txt") cgh.data <- read.maimages(files=filenames, path=datapath, columns=list(G="gMedianSignal", Gb="gBGMedianSignal", R="rMedianSignal", Rb="rBGMedianSignal"), annotation=c("Row", "Col","FeatureNum", "ControlType","ProbeName", "ProbeUID", "SystematicName", "GeneName"), source='agilent')
I want to segment them for CGH analysis. For whatever reason, the files don't have the chromosomal locations included. OK, so I'll get them from the GPL (which according to the GSE7606 is GPL887). Also of note, a txt file of a supposed old version of the GPL data for these files is included in the supplementary data, which we will see does not work:
# try to get directly from GEO; this works! gpl887 <- getGEO("GPL887", destdir="./data/GSE7606/") # try to read from their file; doesn't work! gpl887.included <- getGEO(filename=paste(datapath, "GPL887_old_annotations.txt", sep="/"))
But their file does not load correctly:
> gpl887.included An object of class "GPL" An object of class "GEODataTable" ****** Column Descriptions ****** data frame with 0 columns and 0 rows ****** Data Table ****** data frame with 0 columns and 0 rows
Furthermore, I can't match up IDs from nearly half of the probes in the CGH data with annotations from the GPL data:
> ingpl <- cgh.data$gene$ProbeName %in% Table(gpl887)$SPOT_ID > summary(ingpl) Mode FALSE TRUE NA's logical 10295 11858 0
I've also tried another GSE that has the same platform, with the same results.
Also, trying to load the GSE directly does not work either, and may point to the same problem:
> data.melanoma <- getGEO("GSE7606", destdir=datadir) Found 1 file(s) GSE7606_series_matrix.txt.gz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 9709k 100 9709k 0 0 16.9M 0 --:--:-- --:--:-- --:--:-- 17.9M File stored at: /tmp/Rtmp8NZ4Fj/GPL887.soft Error in validObject(.Object) : invalid class “ExpressionSet” object: featureNames differ between assayData and featureData
What am I missing; how can I get the proper chromosomal coordinates for the probes on this chip?
> sessionInfo() R version 2.14.1 (2011-12-22) Platform: x86_64-redhat-linux-gnu (64-bit) locale:  LC_CTYPE=en_US.utf8 LC_NUMERIC=C  LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8  LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8  LC_PAPER=C LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages:  stats graphics grDevices utils datasets methods base other attached packages:  sva_3.0.3 mgcv_1.7-13 corpcor_1.6.2  DAVIDQuery_1.14.0 RCurl_1.91-1 bitops_1.0-4.1  GOstats_2.20.0 Category_2.20.0 GEOquery_2.21.9  topGO_2.6.0 SparseM_0.96 GO.db_2.6.1  graph_1.32.0 hgu133a2.db_2.6.3 org.Hs.eg.db_2.6.4  RSQLite_0.11.1 DBI_0.2-5 limma_3.10.3  annotate_1.32.3 AnnotationDbi_1.16.19 gcrma_2.26.0  affy_1.32.1 Biobase_2.14.0 ggplot2_0.9.0  reshape_0.8.4 plyr_1.7.1 ProjectTemplate_0.3-5  testthat_0.6 loaded via a namespace (and not attached):  affyio_1.22.0 BiocInstaller_1.2.1 Biostrings_2.22.0  colorspace_1.1-1 dichromat_1.2-4 digest_0.5.2  evaluate_0.4.1 genefilter_1.36.0 grid_2.14.1  GSEABase_1.16.1 IRanges_1.12.6 lattice_0.20-6  MASS_7.3-17 Matrix_1.0-4 memoise_0.1  munsell_0.3 nlme_3.1-103 preprocessCore_1.16.0  proto_0.3-9.2 RBGL_1.30.1 RColorBrewer_1.0-5  reshape2_1.2.1 scales_0.2.0 splines_2.14.1  stringr_0.6 survival_2.36-12 tools_2.14.1  XML_3.9-4 xtable_1.7-0 zlibbioc_1.0.1