Question: Getting sample information from GEO
0
gravatar for Tom_L
2.3 years ago by
Tom_L320
Tom_L320 wrote:

Hello,

It is possible to get sample information from GEO (Gene Expression Omnibus) such that I can create a script to manage a batch of samples?

I'm currently working on a micro-array dataset (GSE59150) containing 873 samples. It is important to me, for each sample, to know gender and ethnicity (in the "Characteristics" section). See example here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1442240

How can I deal with that? I did not see any query to interrogate sample information. In addition, I cannot wget the web page and look into the source code because of the "?acc=GSM1442240" part in the URL. Finally, I did not find a clinical spreadsheet available on GEO or provided by the authors in their paper.

Thank you in advance.

Cheers.

ncbi geo • 1.6k views
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Tom_L320

IF you have all the GSM ids, can't you iterate through them in GEOquery?

ADD REPLYlink written 2.3 years ago by russhh4.7k
5
gravatar for Tom_L
2.3 years ago by
Tom_L320
Tom_L320 wrote:

I answer my own question since I found a convenient way to do that.

There are two problems with the GEOquery package from BioConductor. First: GEOquery required downloading the whole data again (unless I missed an option to only get sample information?) and the raw dataset is nearly 100 Gb. Knowing that I already downloaded the complete dataset, processed it and deleted it due to its volume. Second: I understood that SOFT formatted file contain sample information but GEOquery took ages to load a ~36Gb file (I had to download that one too). I guess that if the dataset was smaller, GEOquery could have been a convenient tool to do that. However, it seems a non-viable option in my case.

What I did: a basic UNIX grep command on the SOFT formatted file. At some point (after the micro-array format definition), sample information are indicated. I caught the pattern to got what I wanted:

zgrep -P "^(\^SAMPLE = GSM|\!Sample_characteristics_ch1 = ethnicity:)" GSE59150_family.soft.gz | paste - - | sed -r 's/(\^SAMPLE = |\!Sample_characteristics_ch1 = ethnicity: )//g'

Basically, this command captures two lines per sample: sample name (starts with ^SAMPLE) and sample ethnicity (starts with !Sample_characteristics_ch1). paste is used to merge two consecutive lines into a single one. sed removes patterns. Output (tab delimited):

GSM1442240 European

GSM1442241 European

GSM1442242 European

...

Hope this will help someone, someday.

Cheers.

ADD COMMENTlink written 2.3 years ago by Tom_L320

Sorry, I didn't realise getGEO("GSM1442240") led to the whole dataset being downloaded. I thought it would pull out the sample-specific meta-info

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by russhh4.7k

No problem. GEOquery actually does the job but requires downloading the SOFT files first. It does not directly query GEO. Moreover: it downloads a 100Mb file per sample if you process them individually. 873 samples times 100Mb each is 87Gb, much more than the 36Gb file available from GEO (whole dataset). This is because the micro-array format is repeated for each sample if you process them serially.

ADD REPLYlink written 2.3 years ago by Tom_L320

Go ahead and accept your own answer (green check mark) to provide closure to this thread.

ADD REPLYlink written 2.3 years ago by genomax71k

I think this goes back to @Istvan's comment below.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax71k
0
gravatar for genomax
2.3 years ago by
genomax71k
United States
genomax71k wrote:

I am sure there is a clever way to use NCBI eUtils to get this information. I am sure someone will post it soon. I am going to give it a try.

The list of sample ID's is available in this file.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by genomax71k

I have asked this from the NCBI folks directly and via channels that I think ought to reach to the developers directly. The lack of response from them makes me believe that the entrez direct tools are not suited for reaching into the content of these files as these are not represented independently in the database. I think these are stored as blobs of text. Hence the tools are unable to query these.

The GEOquery noted by russhh is probably the right approach.

ADD REPLYlink written 2.3 years ago by Istvan Albert ♦♦ 81k
1

I find eUtils documentation (and in-line help) lacking in clarity in general. There is no explicit mention of what combinations of esearch/efetch and databases make sense/are allowed either.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax71k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 619 users visited in the last hour