I have a set of ~2,000 Gene Expression Omnibus (GEO) sample names in the format GSM1234567. For each sample, I need to obtain the age-related metadata, which will include both a value or range of values (e.g. "5" or "18-24") and a unit of time (e.g. "months" or "years"). I located the following helpful thread: Data retrieval from GEO that describes a few strategies to obtain GEO metadata. However, while the strategies suggested in that thread are generally useful, I don't believe that they will allow me to obtain the age-specific metadata needed.
The GEOmetadb R package sounded promising, but it requires downloading the entire GEOmetadb database, and I'm concerned that this file might be huge. Also, since I'm only interested in securing metadata for 2,000 samples, having to download all of the GEO samples' info seems very inefficient and seems likely to cause memory problems with my system...
The GEOquery R package (http://bioconductor.org/packages/release/bioc/html/GEOquery.html) in Bioconductor is very helpful. I can use the getGEO() function to extract the data for a selected sample, and the Meta() function to report the metadata associated with that file. However, while the GEOquery documentation shows a metadata field named “description” that has an Age sub-field, for ~10 other example samples that I have checked from my set, this “description” field isn’t present, and any (rare) age data that I can find is buried in other fields non-systematically, e.g. the “characteristics_ch1” or “extract_protocol_ch1” fields.
Is anyone aware of a consistent structure/method that would permit me to extract the Age metadata from each sample for which it is present? I am reluctant to resort to parsing the metadata to extract this info, as this seems like it will be error-prone, but will need to do this if another simpler method isn’t available.
Thanks in advance for your advice.