Got fewer DE genes after applying bioconductor genefilter for edgeR analysis?
1
1
Entering edit mode
9.1 years ago

I'm using edgeR for DE analysis of RNASeq data. I've been using manual filtering suggested in the EdgeR manual till now.

I followed genefilter package and tried to chose a theta that has the optimal number of rejections. Basically this is what I did: run edgeR once for the data set, then use the p-value results for genefilter calculations and plots.

dge <- DGEList(counts=counts[,!is.na(group)], group=group[!is.na(group)])
d <- dge
d <- calcNormFactors(d)
d <- estimateDisp(d, trend="none", robust=TRUE)
et <- exactTest(d, pair = levels(d$samples$group))
dgeTable = topTags(et, n=nrow(d))

res = data.frame(
        filterstat = rowSums(d$counts[rownames(dgeTable),]),
        pvalue = as.data.frame(dgeTable)$PValue,
        row.names = rownames(dgeTable))

rejBH = filtered_R(alpha=alpha, filter=res$filterstat, test=res$pvalue, theta=theta, method="BH")

theta_chosen = theta[max(which(rejBH==max(rejBH)))]
        use = res$filterstat > quantile(res$filterstat, probs=theta_chosen)
        dgeFilt=dge[use,]
        d <- dgeFilt

I used rowSums as filtering criteria, and the after I filter out the genes - in this case I got theta = 0.19, I ran the edgeR exact test again:

### rerun the model
d <- dgeFilt
d <- calcNormFactors(d)
d <- estimateDisp(d, trend="none", robust=TRUE)

et <- exactTest(d, pair = levels(d$samples$group))

My question is - even though the previous calculated max(rejBH) - the max number of rejections is 102 , using the same alpha (0.1) I'm only getting 60 DE genes after applying the filter. Actually, without applying any filter, the number of DE genes is 75 (this matches rejBH at 0%).

  • Why is the new DE genes number not 102?
  • Why am I getting less DE genes even though from the rejection plots it looks like I'll have the max number of rejections at theta = 0.19?

Am I doing anything wrong here or did I misunderstand something?

Thanks in advance for any suggestion!

genefilter R RNA-Seq edgeR • 3.6k views
ADD COMMENT
2
Entering edit mode
9.1 years ago

You only need to run exactTest() once and you don't need to subset d. Once you have your filter threshold, you just readjust the pvalues of the included rows in res.

BTW, the reason the numbers differ when you subset d is because you now have a different background distribution used for shrinkage.

ADD COMMENT
0
Entering edit mode

Thank you!!

ADD REPLY
0
Entering edit mode

Hi Devon, this is probably kind of a stupid question but I was wondering - wouldn't not rerunning the model kind of like "cheating" ? I guess my original thought was that we filter out the genes that do not pass the filter and just run the model as if they weren't there? My understanding was that when using manual filtering suggested in the edgeR manual, the model was fit after the genes were filtered?

Thank you!

ADD REPLY
0
Entering edit mode

That's not at all a stupid question! Firstly, have a read through the Bourgon et al. paper that layed out the theory behind all of this (they're using microarrays, but it's the same in RNAseq). The group that wrote that went on the make the genefilter Bioconductor package that DESeq2 (also written by some of those authors) uses to do this.

In short, just because we don't have enough signal from a gene to bother testing it for DE doesn't mean that it's not useful for (A) library normalization or (B) variance shrinkage. Yes, the method used in the edgeR vignette prefilters by count (actually, cpm, if I recall correctly) and then fits the remainder. If you search the bioconductor archive, you'll find at least one discussion between the edgeR and DESeq2/genefilter authors about this.

ADD REPLY
0
Entering edit mode

Thanks a lot Devon!

ADD REPLY

Login before adding your answer.

Traffic: 2957 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6