7.1 years ago by
These are good questions. Using the raw matrix files is a good choice if you're doing your own data processing and statistical analysis. Raw CEL files are even better if available, since they can be read and processed using packages in Bioconductor.
The basic answers to your queries are:
- No, there is not always a one-to-one mapping between probe ID and gene for Affymetrix; it depends on which platform is used
- It can be difficult to know what gene maps to which probe using GEO data alone
For example: older Affymetrix arrays, such as the U133A contain probes which (mostly) map to the 3' end of a gene. However, other arrays such as the human exon arrays, use multiple probes for each gene which (mostly) map to known exons.
Regarding GEO: the problem is that the GEO metadata regarding array platform is rather arbitrary (meaning that submitters can enter any details they like). So you get good examples, such as GPL570, with a lot of information about how probe IDs map to known genes. But then there are other examples with very little information.
There are lots of computational ways to map probes to genes and count the results. However, if you are interested primarily in Affymetrix, it's probably best to use the documentation at their website. Here for example is the entry page for 3' arrays. There are a lot of files describing array design at this website - you may have to create a (free) account and login to access some of them. They are a little bewildering at first, but you will get the hang of it by reading the file descriptions and examining the file contents.
If you have more specific questions, such as how to map probe IDs to genes and count results for a specific platform, please ask them (either edit this question or create a new one).