I have run differential expression analyses of RNA-seq data for different species comparing the same tissues at two time points. I next want to look for significant overlaps across species in the lists of differentially expressed genes using hypergeometric tests. To accurately compare like-for-like genes, I've already run other analyses to determine which genes are orthologous between my species.
I can define the variables in the test like so:
- listA = the DE genes in Species A
- listB = the DE genes in Species B
- Overlap = the overlap between listA and listB
Then, I run the analysis in R like so:
phyper(length(Overlap)-1, length(listB), length(Background)-length(listB), length(listA), lower.tail= FALSE)
The problem I'm having is defining the background set of genes to use in the test. Possibilities include:
The union of all genes detected as expressed in both species. For this purpose, I'd take those passing the filtering of DESeq2. Note: this is not keeping only DE genes; rather it is only keeping those genes that pass DESeq2's independent filtering prior to DE analysis.
The intersect of all genes detected as expressed in both species (as per the DESeq2 filtering rationale above). The reasoning for this approach is that taking the intersect only keeps those genes that are capable of being expressed in both species = all orthologous genes.
The latter approach seems preferable to me because I feel a background list should encompass those genes capable of being expressed in both species. I also feel that if the intersect method for defining a background is more appropriate, then "listA" and "listB" should also be pruned to only include genes that have orthologues in both species. This is because if a gene is not detected in the genomes (transcriptomes) of both species then there is no possibility for it to overlap in the DE lists of both species.
However, I'd greatly appreciate the input of others into thinking about this issue!