Question

Control probe sets in Affymetrix ST microarrays

4

Entering edit mode

8.5 years ago

Nathaniel ▴ 120

I am analysing an Affymetrix Mouse Gene ST 2.0 array (http://www.affymetrix.com/catalog/131476/AFFY/Mouse+Gene+ST+Arrays)

According to the annotation file available at the link above, the probesets consist of the following types:

main: the real genes
control->affx: standard AFFX control
control->bgp: Antigenomic or genomic? background control
normgene->intron: from an intronic region of a normalization control gene
normgene->exon: from an exonic region of a normalization control gene
flmrna->unmapped

I have three questions concerning those:

Can someone explain me the difference between the different control probesets and how are they used in the processing steps?
Do I have to include all of them in the normalization process?
I want to remove them for the differnetial expression analysis. To do that, I run getMainProbes() from the 'oligo' package. However, the function does not remove normgene->exons? Is that expected?

affymetrix microarray • 5.7k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.5 years ago by Nathaniel ▴ 120

score 8 · Answer 1 · 2017-09-20

The AFFX probes are more or less positive controls, whilst the BGP probes are background. Together they allow you to determine the signal-to-noise and are thus key to the first part of RMA normalisation, i.e., background correction.

The analysis of Affymetrix arrays in R can be a bit cumbersome, and whether the control probes are removed or not depends on how the package that you're using was written. affy and oligo are two of the packages that I have used. In the past, I've noticed that oligo removes them whilst affy does not.

If you notice them still in your dataset after normalisation, then you can just remove them with something like:

#Extract the control probes
ControlProbes <- grep("AFFX",featureNames(NormalisedData))
ControlProbes

#Fit the linear model and remove the control probes at the same time
fit <- lmFit(NormalisedData[-ControlProbes,], design=DesignMatrix)

Regarding the normgene intron/exon control probes, I can only imagine that they are simply extra positive control probes that target the introns and exons of traditional housekeeping genes like GAPDH, but I cannot confirm that.

score 2 · Answer 2 · 2018-02-28

I also spent a lot of time trying to understand what are the different probe types and controls.Some of the answer can be found in the "Quality Assessment of Exon and Gene Arrays" whitepaper from Affymetrix.

This is the current link:

https://tools.thermofisher.com/content/sfs/brochures/exon_gene_arrays_qa_whitepaper.pdf

And the relevant paragraphs about the normgene->intron and normgene->exon probes:

neg_control is the set of putative intron based probe sets from putative housekeeping genes. Specifically, a number of species specific probesets on 3’ IVT arrays were shown to have constitutive expression over a large number of samples. The genes for these probesets were identified and multiple four probe probesets were selected against the putative intronic regions. (See the respective exon array design Technote for more information.) Thus in any given sample some (or many) of these putative intronic regions may be transcribed and retained. Furthermore, some (or many) of the genes may not be constitutive within certain data sets. These caveats aside, this collection makes for a moderately large collection of probesets which in general have very low signal values. These probesets are used to estimate the false positive rate for the pos_vs_neg_auc metric.

pos_control is the set of putative exon based probe sets from putative housekeeping genes. Specifically, a number of species specific probesets on 3’ IVT arrays were shown to have constitutive expression over a large number of samples. The genes for these probesets were identified and multiples of four probe probesets were selected against the putative exonic regions. (See the respective exon array design Technote for more information.) Thus in any given sample some (or many) of these putative exonic regions may not be transcribed or may be spliced out. Furthermore, some (or many) of the genes may not be constitutive within certain data sets. These caveats aside, this collection makes for a moderately large collection of probesets with target present which in general have moderate to high signal values. These probesets are used to estimate the true positive rate for the pos_vs_neg_auc metric.

The pos_control and all_probeset categories are useful in getting a handle on the overall quality of the data from each chip. Metrics based on these categories will reflect the quality of the whole experiment (RNA, target prep, chip, hybridization, scanning, and griding) and the nature of the data being used in downstream statistical analysis. The polya_spike category are useful for identifying potential problems with the target prep phase of the experiment; the bac_spike category are useful for identifying potential problems with the hybridization and chip. The caveat with these two categories is the limited number of spikes. Thus they should be used to troubleshoot problems whereas the pos_control and all_probeset categories should be used to assess overall quality.

The AUC metric they use to check the differences between pos and neg:

pos_vs_neg_auc is the area under the curve (AUC) for a receiver operating characteristic (ROC) plot comparing signal values for the positive controls to the negative controls. (See Section IV below for more information on the positive and negative probeset categories). The ROC curve is generated by evaluating how well the probe set signals separate the positive controls from the negative controls with the assumption that the negative controls are a measure of false positives and the positive controls are a measure of true positives. An AUC of 1 reflects perfect separation whereas as an AUC value of 0.5 would reflect no separation. Note that the AUC of the ROC curve is equivalent to a rank sum statistic used to test for differences in the center of two distributions. In the case of the exon and gene arrays the positive and negative controls are pseudo positives and negatives (see below). In practice the expected value for this metric is tissue type specific and may be sensitive to the quality of the RNA sample. Values between 0.80 and 0.90 are typical. For exon level analysis an additional ROC AUC metric is reported based on Detected Above BackGround (DABG) detection p-values, dabg_pos_vs_neg_auc.