Get GEO microarray probe sequences by GPL ID
14 months ago
predeus ★ 1.6k

Hello all,

I was wondering if there is a straightforward way to obtain probe sequences for a microarray platform given the GPL ID. I know that there are varying degrees of annotation for microarrays in GEO - most popular ones have "annot" files, while others have "miniml" and "soft" files. However these are all over the place - different gene symbols, IDs, etc.

So, if you can suggest how can I get a simple table "probe id - sequence" using the GEO GPL ID, I would be most grateful.

vkkodali : If you happen to look at this thread I would be curious to know if there is a way to use Entrezdirect. I tried to hack at it some but can't seem to make any headway.

predeus : We may have the best chance of getting an answer from the user I quoted above, so apologies for what may seem like an off-target comment.

I think that you can just search for the GPL ID at GEO and then there should be an entire annotation table to download, no?

There are sometimes "annot" tables, and always the "soft" files. Both contain an annotation table, which varies widely between platforms - very few have actual probe sequences.

Oh, you need probe sequences. I am pretty sure that they are available via biomaRt, the CDF Bioconductor packages, and/or from the manufacturer. The manufacturer definitely has probe sequence files, e.g., Affymetrix U133: http://www.affymetrix.com/support/technical/byproduct.affx?product=hgu133

Thank you.

I've looked at biomaRt (and used it quite a few times) before - they seem to have only the most popular microarray platforms (about 30 different ones for human). If you look at GPLs in GEO, there is over 1000 for each human and mouse. Plus there doesn't seem to be an easy way to match GPL to biomaRt or individual manufacturer's annotation packages since they all seem to use slightly different exact names (I might be wrong, I'm still trying to figure it out).

I was amazed to see that most GPLs that GEO contains don't have sequences at all (which is the only thing you need to annotate them properly). Oh well. It wouldn't be the first thing that's messed up in bioinformatics :)

Yes, only the most common ones (it seems) are included at Ensembl and accessible via biomaRt. The manufacturers' web-sites should have the data though, no?