Question

How To Tell Number Of Genes In A Geo Raw Matrix Txt File Without Using A Converter Tool ?

0

Entering edit mode

12.6 years ago

Cassie ▴ 20

Hi I am kind of new to the bioinformatics field so sorry for some novice questions. Please correct me if I am wrong.

I am learning how to analyze gene expressions raw datasets from GEO system. I usually download the raw matrix txt files from the website directly.

As my understanding, if datasets use a Affymatrix platform, then each probe id in a series matrix file should represent a gene. However, what if the submitter of a matrix file uses other platform? How can I know if each probe present a gene or not ?

Of course, we can convert each probe id to each gene and count how many genes used after converting.Is there some easier way to do so ? Such as checking some kind of document to see how many genes each probe represents ?

Or do I search the wrong database ?

Thank you very much,

geo gene probeset • 3.9k views

ADD COMMENT • link updated 7.8 years ago by Biostar 20 • written 12.6 years ago by Cassie ▴ 20

score 1 · Answer 1 · 2011-09-07

Hi Cassie,

These are good questions. Using the raw matrix files is a good choice if you're doing your own data processing and statistical analysis. Raw CEL files are even better if available, since they can be read and processed using packages in Bioconductor.

The basic answers to your queries are:

No, there is not always a one-to-one mapping between probe ID and gene for Affymetrix; it depends on which platform is used
It can be difficult to know what gene maps to which probe using GEO data alone

For example: older Affymetrix arrays, such as the U133A contain probes which (mostly) map to the 3' end of a gene. However, other arrays such as the human exon arrays, use multiple probes for each gene which (mostly) map to known exons.

Regarding GEO: the problem is that the GEO metadata regarding array platform is rather arbitrary (meaning that submitters can enter any details they like). So you get good examples, such as GPL570, with a lot of information about how probe IDs map to known genes. But then there are other examples with very little information.

There are lots of computational ways to map probes to genes and count the results. However, if you are interested primarily in Affymetrix, it's probably best to use the documentation at their website. Here for example is the entry page for 3' arrays. There are a lot of files describing array design at this website - you may have to create a (free) account and login to access some of them. They are a little bewildering at first, but you will get the hang of it by reading the file descriptions and examining the file contents.

If you have more specific questions, such as how to map probe IDs to genes and count results for a specific platform, please ask them (either edit this question or create a new one).