Question

Getting sample information from GEO

0

Entering edit mode

6.9 years ago

Tom_L ▴ 350

Hello,

It is possible to get sample information from GEO (Gene Expression Omnibus) such that I can create a script to manage a batch of samples?

I'm currently working on a micro-array dataset (GSE59150) containing 873 samples. It is important to me, for each sample, to know gender and ethnicity (in the "Characteristics" section). See example here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1442240

How can I deal with that? I did not see any query to interrogate sample information. In addition, I cannot wget the web page and look into the source code because of the "?acc=GSM1442240" part in the URL. Finally, I did not find a clinical spreadsheet available on GEO or provided by the authors in their paper.

Thank you in advance.

Cheers.

GEO NCBI • 5.7k views

ADD COMMENT • link updated 4 weeks ago by Tania • 0 • written 6.9 years ago by Tom_L ▴ 350

0

Entering edit mode

IF you have all the GSM ids, can't you iterate through them in GEOquery?

ADD REPLY • link 6.9 years ago by russhh 5.7k

0

Entering edit mode

6.9 years ago

GenoMax 141k

I am sure there is a clever way to use NCBI eUtils to get this information. I am sure someone will post it soon. I am going to give it a try.

The list of sample ID's is available in this file.

ADD COMMENT • link 6.9 years ago by GenoMax 141k

0

Entering edit mode

I have asked this from the NCBI folks directly and via channels that I think ought to reach to the developers directly. The lack of response from them makes me believe that the entrez direct tools are not suited for reaching into the content of these files as these are not represented independently in the database. I think these are stored as blobs of text. Hence the tools are unable to query these.

The GEOquery noted by russhh is probably the right approach.

ADD REPLY • link 6.9 years ago by Istvan Albert 100k

1

Entering edit mode

I find eUtils documentation (and in-line help) lacking in clarity in general. There is no explicit mention of what combinations of esearch/efetch and databases make sense/are allowed either.

ADD REPLY • link 6.9 years ago by GenoMax 141k

0

Entering edit mode

3.8 years ago

zhangsx.ivan • 0

Try this data browser. You can change the last digits to you GSEid numbers. There are only some of the columns here. Hope you find what you need! https://www.ncbi.nlm.nih.gov/geo/browse/?view=samples&series=68379 showing columns below: Accession Title Sample type Organism(s) Ch Platform Series Supplementary Contact Release date

ADD COMMENT • link 3.8 years ago by zhangsx.ivan • 0

0

Entering edit mode

4 weeks ago

Tania • 0

This is an old question but it was highly ranked on Google search when I had the same question, so I'll give you my answer from the sample page in your example link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1442240

Scroll down to "Series" and click on the NCBI GEO accession for the whole project that contains this sample, in this example it is "GSE59150" which takes you to: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE59150

From that "Series" page, scroll down to section "Download family" and click on "Series Matrix File(s)" which takes you here: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE59nnn/GSE59150/matrix/

Download the _matrix.txt.gz file, here called "GSE59150_series_matrix.txt.gz". This is what you want!

If you're an R programmer, you can directly load the matrix with the read.table() function and "skip=" however many rows you need to skip. However, I think it's easier to just look at it with a spreadsheet reader like Excel instead of guessing how many rows you need to skip. For Windows users, decompress the gzip file using freeware like 7-Zip of PeaZip to get the .txt file, which you can right-click and open with Excel to read.

You may need to do a bit more data cleanup such as changing cell values from "age: 34" to "34" (find and replace "age: " to ""), then converting to numeric values so you can actually calculate medians and averages, but that clean up will be project specific.

The matrix file contains a lot of sample data after the metadata rows. For microarray projects, I didn't have issues. However, if the file is still too large, you can use R programming's read.table function and add variable nrow=100 to only load the first 100 rows, then write that to a .csv file and open THAT truncated file in Excel instead. It will be smaller.

ADD COMMENT • link 4 weeks ago by Tania • 0

score 5 · Accepted Answer · 2017-05-24

5

Entering edit mode

6.9 years ago

Tom_L ▴ 350

I answer my own question since I found a convenient way to do that.

There are two problems with the GEOquery package from BioConductor. First: GEOquery required downloading the whole data again (unless I missed an option to only get sample information?) and the raw dataset is nearly 100 Gb. Knowing that I already downloaded the complete dataset, processed it and deleted it due to its volume. Second: I understood that SOFT formatted file contain sample information but GEOquery took ages to load a ~36Gb file (I had to download that one too). I guess that if the dataset was smaller, GEOquery could have been a convenient tool to do that. However, it seems a non-viable option in my case.

What I did: a basic UNIX grep command on the SOFT formatted file. At some point (after the micro-array format definition), sample information are indicated. I caught the pattern to got what I wanted:

zgrep -P "^(\^SAMPLE = GSM|\!Sample_characteristics_ch1 = ethnicity:)" GSE59150_family.soft.gz | paste - - | sed -r 's/(\^SAMPLE = |\!Sample_characteristics_ch1 = ethnicity: )//g'

Basically, this command captures two lines per sample: sample name (starts with ^SAMPLE) and sample ethnicity (starts with !Sample_characteristics_ch1). paste is used to merge two consecutive lines into a single one. sed removes patterns. Output (tab delimited):

GSM1442240 European

GSM1442241 European

GSM1442242 European

...

Hope this will help someone, someday.

Cheers.

ADD COMMENT • link 6.9 years ago by Tom_L ▴ 350

0

Entering edit mode

Sorry, I didn't realise getGEO("GSM1442240") led to the whole dataset being downloaded. I thought it would pull out the sample-specific meta-info

ADD REPLY • link 6.9 years ago by russhh 5.7k

0

Entering edit mode

No problem. GEOquery actually does the job but requires downloading the SOFT files first. It does not directly query GEO. Moreover: it downloads a 100Mb file per sample if you process them individually. 873 samples times 100Mb each is 87Gb, much more than the 36Gb file available from GEO (whole dataset). This is because the micro-array format is repeated for each sample if you process them serially.

ADD REPLY • link 6.9 years ago by Tom_L ▴ 350

0

Entering edit mode

Go ahead and accept your own answer (green check mark) to provide closure to this thread.