Question: How To Identify The Method Used To Reduce The Number Of Probesets In A Cel File Obtained From Arrayexpress
gravatar for mtyler.jason
7.9 years ago by
mtyler.jason110 wrote:

Hi all,

I was going through this gene expression data at It has both the raw CEL data and the processed matrix data. I have a question. It uses chipset HG-U133A which has around 22125 probe set ids. If you look at the original CEL file it has around 540909 probes. However,in the processed matrix file you have the 22125 probe sets and their corresponding intensities. I wanted to know how the 540909 probe intensities are filtered to get the corresponding 22125 ones.

I am confused how the preprocessing is done. Suggestions?

gene-expression probeset • 2.7k views
ADD COMMENTlink modified 7.9 years ago by Obi Griffith19k • written 7.9 years ago by mtyler.jason110
gravatar for Obi Griffith
7.9 years ago by
Obi Griffith19k
Washington University, St Louis, USA
Obi Griffith19k wrote:

This is actually a big question. It is often the case for Affymetrix GeneChip data that you have both raw (CEL) files and pre-processed data made available through GEO, ArrayExpress, etc. The CEL file contains intensity values calculated from the actual scanned array images (DAT files). The CEL file together with a CDF file (which describes the layout for an Affymetrix GeneChip array) can be used to calculate an intensity value for each probe. However, individual probes are rarely used in downstream analysis. Instead they are usually summarized together at the probe set level. When Affymetrix designs a GeneChip they target a certain number of specific gene loci and design a set of oligo sequences from an exemplar sequence for each target. Typically there are 11-20 unique oligomeric probes, each 25 bases in length for each targeted gene or transcript. For each oligo probe which matches the target sequence perfectly (PM probes) there is also a corresponding probe with a single mismatch (MM probes). This design explains how you can have 540909 probes which actually represent 22125 probe sets. However, there are many different ways to get from probe intensities to probe sets summary values. Affymetrix provides algorithms (e.g., MAS5 and PLIER) which combines the values from all PM and MM probes into a single estimate of transcript level for each target. Other popular algorithms ignore MM probes (e.g., RMA) and try to account for hybridization effects related to GC content (e.g., GCRMA). To further complicate matters, several groups have redefined the original probesets from Affy by using a more current reference genome and understanding of the transcriptome to produce custom CDF files with different numbers of total probe sets and probes per probe set.

For the specific data set you linked to (E-TABM-157), the ArrayExpress citation looks wrong to me. I believe the original paper can be found here. In their methods you can see that they processed with RMA in R/Bioconductor. This is a very common approach.

Here are some links which might help you understand more:

ADD COMMENTlink written 7.9 years ago by Obi Griffith19k

@Obi Thanks a lot

ADD REPLYlink written 7.9 years ago by mtyler.jason110
gravatar for Sebastian Kurscheid
7.9 years ago by
Australia, ACT, Canberra, ANU
Sebastian Kurscheid300 wrote:

Take a look at the Affymetrix data sheets, e.g. here


Comprised of more than 22,000 probe sets and 500,000 distinct oligonucleotide features.

The CEL file contains the signal intensities for these "oligonucleotide features" which are summarized to a probe level intensity (during image-processing and incorporating information extracted by the GenePix scanner when digitizing the original array). The probe level data (in my opinion) is the only data of interest for further analysis.

ADD COMMENTlink written 7.9 years ago by Sebastian Kurscheid300

@Genomicsio. Ok, However, I want to be sure that it is the case.

ADD REPLYlink written 7.9 years ago by mtyler.jason110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1126 users visited in the last hour