Question

How To Compare The Expression Of Two Genes In Large Microarray Dataset Such As Arrayexpress And Geo

1

Entering edit mode

14.3 years ago

Dror ▴ 280

Is there an easy, programmable way, to extract large datasets from only repository of microarray data, like GEO and ArrayExpress and asses the co-expression of two genes in a large scale expression experiments? In more details: I suspect that two genes should have a similar expression pattern in mammals. So, I want to scan all the micro-array in which these two genes appear, and compare the expression pattern over a variety of experiments.

I would prefer doing in with python/biopython, but perl will be ok too.

microarray geo python • 5.0k views

ADD COMMENT • link updated 14.3 years ago by Neilfws 49k • written 14.3 years ago by Dror ▴ 280

Ram · Answer 1 · 2011-03-11

I've spent some time on programmatic mining of GEO and ArrayExpress. I wish the answer were "yes there is", but it is not.

First, both databases have APIs. The API for the ArrayExpress gene atlas is described here. It is rather limited in terms of queries and somewhat buggy - in fact, it's an internal API exposed to the outside world and is not really ready for general use.

GEO is searchable using EUtils. Programmatic access is described here. I have compiled lists of the terms that you can use to search the Entrez databases at this link: take a look at the gds, geoprofiles and geo text files. All the major programming languages have EUtils libraries: here are links for Bioperl, Biopython and BioRuby. I know the latter best; a simple query might look like this:

#!/usr/bin/ruby
require "rubygems"
require "bio"

# query GEO for GSE
Bio::NCBI.default_email = "me@me.com"
ncbi   = Bio::NCBI::REST.new
search = ncbi.esearch(Homo+sapiens[ORGN] AND GSE[ETYP] AND cel[suppFile]", {"db" => "gds", "retmax" => 200})

That will find GEO series for human studies with supplementary CEL file data.

You will encounter numerous issues with GEO: particularly (1) poorly-annotated samples and errors due to e.g. typos, because standards are not enforced and (2) expression values which may or may not be normalised (and if they are, in a variety of ways). So brace yourself for lots of manual curation.

In fact if you're interested in only a few genes, you may decide that programmatic access is more trouble than it is worth and just explore via the web interfaces. At the NCBI, searching GEO Profiles can be useful for a gene-centric view. The ArrayExpress Gene Atlas interface starts here.

score 4 · Answer 2 · 2011-03-11

I think people of Madcow have already computed this kind of data:

Madcow is a web tool questioning a coexpression data base with experiment filtering and several levels of significance. Results can be filtered, compared and annotated by identification of statistically over-represented Gene Ontology terms. Moreover, the user may visualize a coexpression network from the results by using the Cytoscape tool.