Question

Effect Of Undetected Genes On Ontology Analysis

0

Entering edit mode

10.2 years ago

dario.garvan ▴ 520

What is considered the best way to handle genes that are not detected at all in a two group comparison, when doing an over-representation analysis ?

For example, define all genes with F.P.K.M. < 1 as not detected. I have three different two-group comparisons to make. If all undetected genes are excluded from the analysis, then different ontology categories will be excluded for each comparison, because different categories will have at least the minimum number of genes in a category. The other option is to keep all genes in the analysis. This means that the ontology categories with sufficient genes in the experiment will be the same for all three comparisons, but it has the undesired effect of more multiple testing adjustment for all genes and also the genes with small counts will inevitably be found to not be differentially expressed. This seems to artificially inflate the count of genes that are not differentially expressed, because the genes might truly be differentially expressed, if more sequencing depth covered those genes, for example by targeted RNA-seq. There must be some abundance threshold below which the answer to differential expression should be "don't know" rather than "no".

ontology • 2.1k views

ADD COMMENT • link updated 10.2 years ago by Devon Ryan 104k • written 10.2 years ago by dario.garvan ▴ 520

score 1 · Answer 1 · 2014-02-03

Bourgon et al. nicely showed that you can perform independent filtering to increase power in differential expression analyses (they were using microarrays, but the same holds for RNAseq). The general idea, as you seem to be correctly alluding, is that there's some expression level below which we simply lack sufficient power to even bother testing for significant DE. If you perform independent filtering (see the genefilter package in Bioconductor for some handy functions), then you'll end up weeding out those low expressing genes in a meaningful way. This should produce the better GO results you're after and have the added bonus of also yielding better DE results.

BTW, have a look at the camera() and roast() functions in the limma package for methods for gene-set testing that are better than the standard "just do a hypergeometric test" methods.

One last note is that you'll probably do the filtering on the whole dataset if you plan at the outset to make all two-way comparisons between 2 groups (I say "probably" because I don't know the details of your experiments), rather than performing each comparison completely separately.