I'm learning to analyse data from Human Exon arrays and found something curious, which I don't know how to handle. I searched BioStar and couldn't find anything closely related to this issue.
I've done all processing up to generating a list of "differentially expressed probe sets" (DEPS) with RMA/limma without any problems. I run RMA at the probeset level and used biomart to get the gene annotation information based on the DEPS. (I tried the getNetAffx function as well to no avail; I still didn't know which gene symbol to choose for some probesets.)
When I looked at the annotated results I noticed that more than 600 probesets annotated to more than gene symbol (or Entrez, Emsembl, it didn't matter...). I know that the converse is absolutely fine (2 or more probesets annotating to the same gene) but wasn't expecting it to be the other way around.
I then batch-searched for annotation information directly on the NetAffx website and, still, got more than 1 gene symbol for some of the probesets.
My question is: how to choose the appropriate gene symbol for a given probeset when there are multiple hits? I'm leaning towards picking the first gene symbol returned from the NetAffx query but this seemed too crude...
Perhaps a related question would be: should I forget about analyzing data at the probeset level and simply do it at the transcript cluster (gene) level instead?