Data retrieval from GEO
3
0
Entering edit mode
7.3 years ago
lessismore ★ 1.4k

Dear all,

I have an extremely long list of unannotated CEL files and I would like to fetch metadata from GEO in bulk. Any advice for that?

Thanks in advance!

Microarray GEO • 4.2k views
ADD COMMENT
2
Entering edit mode

Solved: It's quite tricky

e.g. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=sra&id=010170,..

First you extract the experiment ID associated to the GSM ID, then you can redo the search to extract the values for the specific term you are interested. If anyone has a faster solution it would be very welcome

ADD REPLY
0
Entering edit mode

Could you post the GSE ID. I'll try and get back to you.

ADD REPLY
0
Entering edit mode

Dont have the GSE, just have the GSM IDs

ADD REPLY
0
Entering edit mode

Could you please give me a couple of those?

ADD REPLY
0
Entering edit mode

Hey, let's work with GSM85508 for example.

Open https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM85508 and scroll down. You will find a label named "Series", beside which the GSE ID is mentioned. Click on the GSE ID, and at the bottom of the page, there will be a link to a compressed file. The compressed file would contain the .cel files

ADD REPLY
0
Entering edit mode

Thanks but im interested in the bulk metadata download. should be with this but its not working:

esummary.fcgi?db=database&id=uid1,uid2,uid3,...
ADD REPLY
1
Entering edit mode
7.3 years ago
James Ashmore ★ 3.5k

You can use the Entrez utilities to get the information you need:

$ esearch -db gds -query GSM85508 | efetch

1. Basal-like breast cancer tumors
Analysis of sporadic basal-like cancer (BLC), BRCA-associated breast cancer, and non-BLC tumors. Sporadic BLC are phenotypically similar to BRCA1-associated cancers. Results provide insight into the molecular pathogenesis of BLC and BRCA1-associated breast cancer.
Organism:   Homo sapiens
Type:       Expression profiling by array, transformed count, 4 disease state sets
Platform: GPL570 Series: GSE3744 47 Samples
FTP download: GEO (CEL) ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS2nnn/GDS2250/
DataSet     Accession: GDS2250  ID: 2250

2. Human breast tumor expression
(Submitter supplied) Gene expression for 47 human breast tumor cases; (* normalized by GCRMA for global expression analysis) Keywords: Type
Organism:   Homo sapiens
Type:       Expression profiling by array
Dataset: GDS2250 Platform: GPL570 47 Samples
FTP download: GEO (CEL) ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE3nnn/GSE3744/
Series      Accession: GSE3744  ID: 200003744

3. [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array
(Submitter supplied) Affymetrix submissions are typically submitted to GEO using the GEOarchive method described at http://www.ncbi.nlm.nih.gov/projects/geo/info/geo_affy.html  June 03, 2009: annotation table updated with netaffx build 28 June 06, 2012: annotation table updated with netaffx build 32 June 23, 2016: annotation table updated with netaffx build 35 Protocol: see manufacturer's web site  Complete coverage of the Human Genome U133 Set plus 6,500 additional genes for analysis of over 47,000 transcripts All probe sets represented on the GeneChip Human Genome U133 Set are identically replicated on the GeneChip Human Genome U133 Plus 2.0 Array. more...
Organism:   Homo sapiens
602 DataSets 4699 Series 58 Related Platforms 133119 Samples
FTP download: GEO ftp://ftp.ncbi.nlm.nih.gov/geo/platforms/GPLnnn/GPL570/
Platform    Accession: GPL570   ID: 100000570

4. T92 U133p2
Organism:   Homo sapiens
Source name:    T92
Platform: GPL570 Series: GSE3744 Dataset: GDS2250
FTP download: GEO (CEL) ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM85nnn/GSM85508/
Sample      Accession: GSM85508 ID: 300085508

To do a batch search you'll need to write a script to read each of your GSM accessions and send a query to NCBI with the above command.

ADD COMMENT
0
Entering edit mode

Thanks very useful. If for each GSM i would like to retrieve only the GSE? Could you comment on this?

ADD REPLY
0
Entering edit mode

I guess you could just use the grep command line tool to retrieve the GSE numbers:

grep "Series:"
ADD REPLY
1
Entering edit mode
ADD COMMENT
0
Entering edit mode
ADD COMMENT
0
Entering edit mode

Thanks, what im searching is problably the esummary tool, but i tried for 2 ids in sra database and it doesnt work.

esummary.fcgi?db=database&id=uid1,uid2,uid3,...
ADD REPLY

Login before adding your answer.

Traffic: 1886 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6