7.9 years ago by
Washington University, St Louis, USA
This is actually a big question. It is often the case for Affymetrix GeneChip data that you have both raw (CEL) files and pre-processed data made available through GEO, ArrayExpress, etc. The CEL file contains intensity values calculated from the actual scanned array images (DAT files). The CEL file together with a CDF file (which describes the layout for an Affymetrix GeneChip array) can be used to calculate an intensity value for each probe. However, individual probes are rarely used in downstream analysis. Instead they are usually summarized together at the probe set level. When Affymetrix designs a GeneChip they target a certain number of specific gene loci and design a set of oligo sequences from an exemplar sequence for each target. Typically there are 11-20 unique oligomeric probes, each 25 bases in length for each targeted gene or transcript. For each oligo probe which matches the target sequence perfectly (PM probes) there is also a corresponding probe with a single mismatch (MM probes). This design explains how you can have 540909 probes which actually represent 22125 probe sets. However, there are many different ways to get from probe intensities to probe sets summary values. Affymetrix provides algorithms (e.g., MAS5 and PLIER) which combines the values from all PM and MM probes into a single estimate of transcript level for each target. Other popular algorithms ignore MM probes (e.g., RMA) and try to account for hybridization effects related to GC content (e.g., GCRMA). To further complicate matters, several groups have redefined the original probesets from Affy by using a more current reference genome and understanding of the transcriptome to produce custom CDF files with different numbers of total probe sets and probes per probe set.
For the specific data set you linked to (E-TABM-157), the ArrayExpress citation looks wrong to me. I believe the original paper can be found here. In their methods you can see that they processed with RMA in R/Bioconductor. This is a very common approach.
Here are some links which might help you understand more: