I am using Oligo package to preprocess CEL files from Human Exon 1.0 st array. I have summarised expression data to the level of transcript cluster (rma(celfiles,target="core")) and end up with a total number of 22011 transcript clusters. I would love to perform gene level analysis instead of trascript level, therefore I need to map the transcript clusters to the genes.
From the following command: featureData(exonCore) <- getNetAffx(exonCore, "transcript") I have obtained the corresponding annotation file. However, when I looked into the annotation information from pData(featureData(exonCore))[,c("probesetid","geneassignment")], it looks like a few thousand transcript clusters do not have gene assignments at all. That may be a smaller of an issue but more importantly, a lot of transcript clusters are mapped to many gene symbols. The geneassignment column has many entries.
When I take away the transcript clusters that are mapped to multiple gene symbols, I end up with around 12,000 or 14,000 transcript clusters that can uniquely map to genes. This number looks too few for me, as for example TCGA exon expression data contains about 18,000 genes.
Do I use the annotation file correctly or I have already misdone something here? Is that a generally better strategy to summarise to the level of probe set and then represent the genes with their constituent probe sets somehow?