Question: Ask for suggestions of filtering probes in affymetrix data
0
gravatar for liux.bio
5 months ago by
liux.bio340
China
liux.bio340 wrote:

Hi, biostars.

I am analyzing an affymetrix microarray dataset. After searching, I found following blogs about filtering probes:

Following the blogs, I determined the filtering standard

  • For each sample, compute 95th percentile of the negative control genes (normgene -> intro)
  • For a gene in a sample, if the gene's value is greater than the 95th percentile of the negative control genes in this sampe, we assume this gene is detected
  • if a gene is detected in more than half of all the samples, select this gene, otherwise, filter it.

After filtering, I obtain 6543 probes (the number of all probes is 53617). I want to know if the filtering standard is reasonable. Please give me some suggestions.

Here is my code:

library(oligo)
# load data 
celFiles <- list.celfiles("./rawData/GSE83452/celFiles/", full.names = TRUE)
rawData <- read.celfiles(celFiles)
# normalize 
processedData <- rma(rawData)

# filter genes with low expression
librarypd.hugene.2.0.st)
con <- dbpd.hugene.2.0.st)
# negative-control probes (normgene -> intro)
NCprobes <- dbGetQuery(con,  "select meta_fsetid from core_mps inner join featureSet on 
 core_mps.fsetid=featureSet.fsetid where featureSet.type='10';")
#  the 95th percentile of the negative control genes 
NCquantileValue <- apply(exprs(processedData)[as.character(NCprobes[,1]),], 2, quantile, 
 probs=0.95)

# function to filter genes
filterGeneDetected <- function(sampleValues, quantileValues, detectedRatio) {
    # the length of sampleValues and that of quantileValues should be the same
    # In a sample, if a gene's is up quantileValue, we say this gene is detected
    # If a gene is detected in more than detectedRatio * allSampels, the gene is selected
    samplesLength <- length(sampleValues)
    detectedNum <- 0
    for (index in c(1:samplesLength)) {
       if (sampleValues[index] > quantileValues[index]) {
           detectedNum =  detectedNum + 1
        }  
        }
    ifSelected <- FALSE
    if (detectedNum/samplesLength >= detectedRatio) {
        ifSelected = TRUE
     }
     return (ifSelected)
   } 

detectedProbes <- apply(exprs(processedData), 1, 
                    filterGeneDetected, 
                    quantileValues = NCquantileValue,
                    detectedRatio = 0.5)
processedDataFilter <- processedData[detectedProbes, ]

Many thanks for your kindness.

affymetrix microarray • 231 views
ADD COMMENTlink modified 12 weeks ago by Biostar ♦♦ 20 • written 5 months ago by liux.bio340

That strategy for defining 'detection' seems reasonable to me, in this dataset that does not have a Affymetrix MAS Absent/Present call. And the number of surviving genes seems reasonable given your criteria. One comment about the criteria of insisting on detection in >= 50% of the samples. You may want to consider applying that filter separately within each experimental group of interest. For example, in GSE83542, say you wanted to compare NASH vs. non-NASH at baseline (n=104 vs. 44). If you had a gene that was detected in all non-NASH, but none of the NASH samples, that might be a very interesting gene, but would be filtered out by your final criterion since it was detected in only 44 of the samples (less than 50% of all samples in the dataset). If you applied your 50% detection threshold within each of the two comparison groups, and retained genes detected at 50% within any experimental group, that gene would survive your filter.

ADD REPLYlink written 12 weeks ago by Ahill1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1151 users visited in the last hour