Question: Retrieving age metadata for GEO samples
1
gravatar for EverInEarnest
10 months ago by
EverInEarnest30 wrote:

I have a set of ~2,000 Gene Expression Omnibus (GEO) sample names in the format GSM1234567. For each sample, I need to obtain the age-related metadata, which will include both a value or range of values (e.g. "5" or "18-24") and a unit of time (e.g. "months" or "years"). I located the following helpful thread: Data retrieval from GEO that describes a few strategies to obtain GEO metadata. However, while the strategies suggested in that thread are generally useful, I don't believe that they will allow me to obtain the age-specific metadata needed.

The GEOmetadb R package sounded promising, but it requires downloading the entire GEOmetadb database, and I'm concerned that this file might be huge. Also, since I'm only interested in securing metadata for 2,000 samples, having to download all of the GEO samples' info seems very inefficient and seems likely to cause memory problems with my system...

The GEOquery R package (http://bioconductor.org/packages/release/bioc/html/GEOquery.html) in Bioconductor is very helpful. I can use the getGEO() function to extract the data for a selected sample, and the Meta() function to report the metadata associated with that file. However, while the GEOquery documentation shows a metadata field named “description” that has an Age sub-field, for ~10 other example samples that I have checked from my set, this “description” field isn’t present, and any (rare) age data that I can find is buried in other fields non-systematically, e.g. the “characteristics_ch1” or “extract_protocol_ch1” fields.

Is anyone aware of a consistent structure/method that would permit me to extract the Age metadata from each sample for which it is present? I am reluctant to resort to parsing the metadata to extract this info, as this seems like it will be error-prone, but will need to do this if another simpler method isn’t available.

Thanks in advance for your advice.

metadata geo • 592 views
ADD COMMENTlink modified 10 months ago by Sean Davis25k • written 10 months ago by EverInEarnest30
1
gravatar for Sean Davis
10 months ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

To my knowlege, there is not any approach to query GEO to get age-related metadata without parsing text (and guessing the meaning). The Biosample database at NCBI is beginning to provide tag-value pairs for metadata, but GEO datasets are not yet includes (though SRA data are).

You could give a new project, STARGEO a try. However, I suspect that it will not provide you with complete information, either.

https://stargeo.org

As for GEOmetadb, the file is big but not huge and is a SQLite database, so it will not overwhelm your system's memory. That said, you will still need to parse text, as we did not attempt to mine the metadata ourselves (that probably several grants worth of work).

Finally, if you do happen to have data that are clustered into a few GEO Series records, I have made some recent changes to how GEOquery parses GEO GSEMatrix records. In particular, if the age information is annotated as "age: XXX" in one of the characteristics columns, GEOquery will try to put that into a separate column in the associated sample information. You can then pull the rows from the sample information you like.

ADD COMMENTlink modified 10 months ago • written 10 months ago by Sean Davis25k

Thanks for your helpful response, Sean! I will look into STARGEO to determine whether it will be helpful. I don't believe that my ~2,000 samples are clustered into only a few GEO Series records, but I'll check out your updates to how GEOquery parses the GEO GSEMatrix records.

ADD REPLYlink written 10 months ago by EverInEarnest30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1873 users visited in the last hour