Effect Of Undetected Genes On Ontology Analysis
1
0
Entering edit mode
10.2 years ago
dario.garvan ▴ 520

What is considered the best way to handle genes that are not detected at all in a two group comparison, when doing an over-representation analysis ?

For example, define all genes with F.P.K.M. < 1 as not detected. I have three different two-group comparisons to make. If all undetected genes are excluded from the analysis, then different ontology categories will be excluded for each comparison, because different categories will have at least the minimum number of genes in a category. The other option is to keep all genes in the analysis. This means that the ontology categories with sufficient genes in the experiment will be the same for all three comparisons, but it has the undesired effect of more multiple testing adjustment for all genes and also the genes with small counts will inevitably be found to not be differentially expressed. This seems to artificially inflate the count of genes that are not differentially expressed, because the genes might truly be differentially expressed, if more sequencing depth covered those genes, for example by targeted RNA-seq. There must be some abundance threshold below which the answer to differential expression should be "don't know" rather than "no".

ontology • 2.1k views
ADD COMMENT
1
Entering edit mode
10.2 years ago

Bourgon et al. nicely showed that you can perform independent filtering to increase power in differential expression analyses (they were using microarrays, but the same holds for RNAseq). The general idea, as you seem to be correctly alluding, is that there's some expression level below which we simply lack sufficient power to even bother testing for significant DE. If you perform independent filtering (see the genefilter package in Bioconductor for some handy functions), then you'll end up weeding out those low expressing genes in a meaningful way. This should produce the better GO results you're after and have the added bonus of also yielding better DE results.

BTW, have a look at the camera() and roast() functions in the limma package for methods for gene-set testing that are better than the standard "just do a hypergeometric test" methods.

One last note is that you'll probably do the filtering on the whole dataset if you plan at the outset to make all two-way comparisons between 2 groups (I say "probably" because I don't know the details of your experiments), rather than performing each comparison completely separately.

ADD COMMENT
0
Entering edit mode

Filtering on the whole dataset at once doesn't seem specific enough. Consider the case

Time 1 : 100 110 115 Time 2 : 5 3 0 Time 3 : 0 2 1

Filtering on the whole dataset using a cutoff such as at least 2 observations >= 10 reads would include this gene for all contrasts, but it's only interesting for the Time 2 - Time 1 contrast or the Time 3 - Time 1 contrast, not the Time 3 - Time 2 contrast.

ADD REPLY
0
Entering edit mode

Yeah. Most of the examples you'll see of filtering will do it on the whole dataset. You could go ahead and do it for each comparison, though (in fact, you would normally filter on the final output just prior to adjusting the p-values).

ADD REPLY

Login before adding your answer.

Traffic: 4034 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6