Question: How To Tell Number Of Genes In A Geo Raw Matrix Txt File Without Using A Converter Tool ?
gravatar for Cassie
9.4 years ago by
Cassie20 wrote:

Hi I am kind of new to the bioinformatics field so sorry for some novice questions. Please correct me if I am wrong.

I am learning how to analyze gene expressions raw datasets from GEO system. I usually download the raw matrix txt files from the website directly.

As my understanding, if datasets use a Affymatrix platform, then each probe id in a series matrix file should represent a gene. However, what if the submitter of a matrix file uses other platform? How can I know if each probe present a gene or not ?

Of course, we can convert each probe id to each gene and count how many genes used after converting.Is there some easier way to do so ? Such as checking some kind of document to see how many genes each probe represents ?

Or do I search the wrong database ?

Thank you very much,

geo gene probeset • 3.1k views
ADD COMMENTlink modified 4.5 years ago by Biostar ♦♦ 20 • written 9.4 years ago by Cassie20
gravatar for Neilfws
9.4 years ago by
Sydney, Australia
Neilfws49k wrote:

Hi Cassie,

These are good questions. Using the raw matrix files is a good choice if you're doing your own data processing and statistical analysis. Raw CEL files are even better if available, since they can be read and processed using packages in Bioconductor.

The basic answers to your queries are:

  • No, there is not always a one-to-one mapping between probe ID and gene for Affymetrix; it depends on which platform is used
  • It can be difficult to know what gene maps to which probe using GEO data alone

For example: older Affymetrix arrays, such as the U133A contain probes which (mostly) map to the 3' end of a gene. However, other arrays such as the human exon arrays, use multiple probes for each gene which (mostly) map to known exons.

Regarding GEO: the problem is that the GEO metadata regarding array platform is rather arbitrary (meaning that submitters can enter any details they like). So you get good examples, such as GPL570, with a lot of information about how probe IDs map to known genes. But then there are other examples with very little information.

There are lots of computational ways to map probes to genes and count the results. However, if you are interested primarily in Affymetrix, it's probably best to use the documentation at their website. Here for example is the entry page for 3' arrays. There are a lot of files describing array design at this website - you may have to create a (free) account and login to access some of them. They are a little bewildering at first, but you will get the hang of it by reading the file descriptions and examining the file contents.

If you have more specific questions, such as how to map probe IDs to genes and count results for a specific platform, please ask them (either edit this question or create a new one).

ADD COMMENTlink written 9.4 years ago by Neilfws49k

Just a note that the GEOquery Bioconductor package ( can be useful for dealing with GEO text files and supplemental files (.CEL, etc.).

ADD REPLYlink written 9.4 years ago by Sean Davis26k

Totally agree Sean, though the package has been a little flaky of late.

ADD REPLYlink written 9.4 years ago by Neilfws49k

Thanks. I will check those tools.

ADD REPLYlink written 9.4 years ago by Cassie20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2790 users visited in the last hour