For instance you have the probe 207739_s_at that match several genes (> 10) of the [GAGE] family. [?][?] You have also the probe 217365_at that match several member genes (> 5) of the [PRAME] family.
This is definitely the case, a single probeset can contain a majority of probes which map to more than one location in the genome.
So I used SCAMPA, http://web.bioinformatics.ic.ac.uk/scampa/section.html?id=5
To do this, the tool has pre-defined thresholds for each of its levels, but you should be able to hack the source to define them yourself.
Of course, these corrections are sensitive to the genome-build you are using.
It depends to some degree on the array platform but yes, to reiterate what has already been said, probes can match to more than one location. This can be due to duplication within the genome: for example, the 5'-end of the X chromosome is very similar to the Y chromosome, so probesets such as
218951_s_at (from the HG-U133A platform) match both.
There are tools to deal with this, but one approach is to download the relevant data from e.g. UCSC or Affymetrix and process it with a custom script to remove probesets with > 1 location.
Yes, and this can be annoying when doing downstream annotation like KEGG pathways, etc. For example, if one probe says MapK is going Up and another says MapK is going down, how should I annotate this on a "gene-centered" graph like GO or KEGG.
My solution has been to use the "BrainArray" custom CDFs. These are created, and updated weekly, to reconstruct the affy-probesets so that each probeset matches a SINGLE ID. They have a build for UniGene (every probeset matches to a single UniGene ID), Entrez Gene, Entrez Protein, and dozens of others.
I've found that this make my downstream annotation much easier when I'm dealing with gene and protein level annotations. The only problem is that you need to have the RAW CEL data to use these CDFs.
Hope that helps, Will