Querying Ncbi Geo By Platform Id
2
4
Entering edit mode
13.3 years ago
Andrew Su 4.9k

I want to query NCBI's GEO to get all gene expression series (GSExxxx) GEO Data Sets (GDSxxxx) corresponding to a given platform (e.g., GPL570, which corresponds to Affy's U133 Plus 2.0 chip). I can clearly screen-scrape that GPL570 page (specifically the section that notes there are 1683 "Series" on this platform), but I'd rather not.

I've checked out the page on programmatic access to GEO, but I can't find a query that will work. Any E-Util whizzes know how this would be done?


EDIT: Aargh, after further review, I realize I need GEO Data Sets (GDSxxx), not the GSEs. Following Neil's answer below for GSEs, I've scanned the Meta(gpl) list but can't find the analogous entry for GDSs. Anyone know how I can do this?

In this case I can use an eUtils query for GPL570, but since my downstream analysis is in R I'd much rather use GEOquery...

ncbi geo eutils gene • 6.4k views
ADD COMMENT
1
Entering edit mode

I don't think you can get to GDS using GPL in GEOquery; you'll have to fall back on EUtils. You can use GPL570[ACCN] in your query.

ADD REPLY
6
Entering edit mode
13.3 years ago
Neilfws 49k

I'd probably use GEOquery from Bioconductor for this (it uses EUtils queries under the hood).

Brief example:

library(GEOquery)
gpl <- getGEO("GPL570") # may take a long time
length(Meta(gpl)$series_id)
# => 1683
# the first 10 series
head(Meta(gpl)$series_id, n = 10L)
#  [1] "GSE1145" "GSE1643" "GSE2109" "GSE2125" "GSE2298" "GSE2327" "GSE2328"
#  [8] "GSE2397" "GSE2435" "GSE2478"
ADD COMMENT
1
Entering edit mode

It's not the best documentation: I believe earlier versions had better examples. That said, the getGEO() and Meta() methods are in there. I spent quite some time studying the structure of GPL, GSE and GSM files, which helped. They are all quite similar with 2 parts: metadata, accessed via Meta() and a data table, accessed via Table(). I also found BioRuby's Bio::SOFT library useful - http://bioruby.org/rdoc/classes/Bio/SOFT.html.

ADD REPLY
1
Entering edit mode

Oh, just a quick FYI. GEOquery does not use eUtils under the hood. The queries are directly against NCBI GEO and are everything from reading ftp directory listings to get file listings to http GET requests to downloads from ftp.

ADD REPLY
0
Entering edit mode

+1 Code works for me. The getGEO call took 97 seconds.

ADD REPLY
0
Entering edit mode

Considering I was already using GEOquery, perfect. One naive question -- how would I have found this using the online documentation? I looked through both the vignette and the manual, but the answer is still not jumping out at me...

ADD REPLY
0
Entering edit mode

Thanks for the feedback on documentation. I added this example to the vignette as a use case. I do assume way too much knowledge of the structure of GEO records and need to continue to improve the documentation to that end.

ADD REPLY
2
Entering edit mode
13.3 years ago

Another option is GEOmetadb. This is a BioConductor package and SQLite database, the tradeoff being you need to pull down all the metadata to your local system, but once you do you can make lots of fast queries. If you are handy with SQL, you can just use the database by itself or from another environment.

ADD COMMENT
0
Entering edit mode

Just a note here. The "pulling all the metadata to your local system" is a one-line command and is only about a 50MB download (compressed). The data are in the form of a SQLite file that can, of course, be accessed from any language with a SQLite API, so the GEOmetadb R package is not a prerequisite.

ADD REPLY

Login before adding your answer.

Traffic: 2069 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6