getGEO not returning correct sample names
0
0
Entering edit mode
3.6 years ago
bjwiley23 ▴ 20

Hi,

Normally when I use getGEO for microarray experiment, the sampleNames() defaults to the "GSM**" name but not for GEO="GSE6691". It's defaulting the sampleNames to the data for probe 201088_at. Not sure why this is so. It works perfectly fine for other GEO IDs such as "GSE19147". Why is this? Here is code and output. There is also supposed to be 22283 features but I only get 21667. I get same parsing errors for "GSE19147" but I get correct sampleNames and correct 22283 features.

Input:

my.gse2 <- "GSE6691"
my.geo.gse2 <- getGEO(GEO=my.gse2,
                      destdir="./geo_downloads",
                      getGPL=T)

Output:

Parsed with column specification:
cols(
  .default = col_double(),
  `201088_at` = col_character()
)
See spec(...) for full column specifications.
Using locally cached version of GPL96 found here:
./geo_downloads/GPL96.soft 
Warning: 68 parsing failures.
  row     col           expected    actual         file
22216 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
22217 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
22218 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
22219 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
22220 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
..... ....... .................. ......... ............
See problems(...) for more details.

Input:

my.geo.gse2
sampleNames(my.geo.gse2)

Output:

$GSE6691_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 21667 features, 56 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: 5.82592 6.799907 ... 5.8102 (56 total)
  varLabels: title geo_accession ... data_row_count (52 total)
  varMetadata: labelDescription
featureData
  featureNames: 201089_at 201090_x_at ... AFFX-TrpnX-M_at (21667 total)
  fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 17252022 
Annotation: GPL96


[1] "5.82592"  "6.799907" "5.388776" "6.630125" "7.093243" "6.426071" "6.482608" "6.363402"
 [9] "5.429877" "6.561914" "7.302787" "7.182047" "6.25908"  "8.237549" "6.504661" "6.386279"
[17] "6.709859" "5.855908" "6.686627" "6.401285" "6.066935" "7.919946" "6.703274" "7.679963"
[25] "8.246524" "7.534912" "7.483763" "7.917386" "5.945879" "5.479577" "5.179641" "5.507846"
[33] "5.033364" "7.497475" "8.670058" "6.941275" "7.268032" "8.177226" "7.658442" "6.746167"
[41] "5.127423" "7.357877" "7.106516" "6.029429" "6.865016" "4.837314" "6.368258" "6.591957"
[49] "5.837441" "6.472433" "6.593988" "7.668101" "5.39544"  "4.960329" "6.512131" "5.8102"
getGEO R • 1.1k views
ADD COMMENT
0
Entering edit mode

Well at least I know where the problem is but as to why I am still concerned. That probe 201088_at is at line 616 which is the exact amount of probes missing from the getGEO() ExpressionSet. It is starting at line 616 of the data using the data from that line as the "header" or sample names. Where would I report something like this?

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

FWIW it seems this might be an unexpected skip behavior in readr library's read_tsv for this particular GSE. Which might be worth reporting at the places recommended in the link above. Presuming you have downloaded and unzipped the series matrix file, using the traditional read.delim in place of read_tsv:

fname <- "GSE6691_series_matrix.txt"
dat <- read_lines(fname)
series_table_begin_line = grep("^!series_matrix_table_begin", dat)
datamat <- read.delim(fname, skip=series_table_begin_line, quote='"', comment.char="!")
dim(datamat)
datamat[1:3,1:3]

shows the expected result, a 22283x57 matrix.

ADD REPLY

Login before adding your answer.

Traffic: 1505 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6