goverlap is a modern ontology term enrichment analyzer.
While there exist many similar tools - such as the R library GOstats or the immensely popular online tool DAVID - goverlap contains an integrated package not found elsewhere. It
- supports two input lists to find whether there are significant differences in term enrichment between experiments
- contains several ontologies, and can easily be extended with more
- has a command line interface, which makes it trivial to analyze and compare any number of files
- does not require any setup or manual downloads
- is open-source
goverlap is a command line reimplementation of the now defunct web tool
GeneTools) described in the paper GeneTools – application for functional annotation and statistical hypothesis testing.
The result of biological studies is often a list of genes that are found to be differentially expressed in the experimental condition. But to understand your results, a list of non-sensical gene names is of little help. You want to know what happened in the experiment.
Luckily, for most genes, several of their biological functions are known. This knowledge exists in annotations that map gene names to functions.
This makes it possible to turn a list of genes (your results) into the biological functions that were affected in the experimental condition.
In short, you go from gobbeldygook like this:
Tubb2b Tubb5 Crmp1 # several hundred more gene names...
to something a biologist can understand, like
Neuroactive ligand-receptor interaction or
Calcium signaling pathway.
There are tens of thousands of categories of biological functions and processes, written down in classification schemes called ontologies. Three randomly drawn examples of such categories are:
Generation of a long process of a CNS neuron,
amine transport and
antigen receptor-mediated signaling pathway
These all have known sets of genes associated with them.
For each such category, goverlap finds the number of genes
n in your input gene list that is represented in the list of genes associated with that ontological category. Then it computes the probability that
n genes from your gene list would be found associated with that ontological category by chance. If this probability is low, your genes are likely to be involved in whatever function that ontology category represents.
How goverlap tells you this is shown in the example output below:
go_root ontology a_hpgm_go_fdr a_odds a a_I_go go_subtree U a_hpgm_go go_annotation GO:0005515 MF 2.71e-55 1.05 1281 808 9038 21941 1.8e-59 protein binding GO:0005488 MF 1.61e-52 1.78 1281 998 12610 21941 2.2e-56 binding ...
We see the corrected p-value (third column) is very low for both the shown categories. This means that your input genes are very likely to be involved in
protein binding and the
binding categories (Note that the actual list would contain thousands of more rows, one for each ontology category.)
An explanation of each column in the output:
go_rootis the category for which we are investigating possible over- or under-representation
ontologyis the classification scheme in which the category exists
a_hpgm_go_fdris the value
a_hpgm_gocorrected for multiple testing with
a_oddsis the odds ratio which tells the degree and magnitude of over- or under-representation
ais the number of genes in the input gene list
a_I_gois the number of genes both in the input and associated with the category
go_subtreeis the number of genes associated with the category
Uis the total number of all genes used in the experiment
a_hpgm_gois the p-value for the probability of the input genes being overrepresented in the category, computed with the hypergeometric test
go_annotationis a description of the category
Two input files
goverlap accepts two input gene lists. In this case it computes the statistics listed above for each input file. In addition, it also computes the probability that there is a statistically significant difference in the degree the gene lists overlap with each ontology category.
The two-list output looks like this:
go_root ontology chi_sq_fdr a_hpgm_go_fdr a_odds b_hpgm_go_fdr b_odds a_I_b_I_go a_M_b_I_go b_M_a_I_go go_subtree U a_I_b a_M_b b_M_a chi_sq test_obs b_hpgm_go a_hpgm_go go_annotation GO:0034645 BP 1.4e-08 8.7e-49 0.29 8.2e-09 0.3 104 744 53 3645 21941 4238 2957 277 6.1e-11 42.79 1.1e-10 4.91e-51 cellular macromolecule biosynthetic process GO:0048869 BP 3.5e-08 3.39e-84 0.28 3.49e-22 0.29 105 739 67 3169 21941 4238 2957 277 1.6e-10 40.9 7.2e-25 6.8e-87 cellular developmental process ...
An explanation of each column in the output that does not exist in the single list case:
chi_sq_fdris the value
chi_sqcorrected for multiple testing
a_I_b_I_gois the number of genes in the intersection of gene list a and b and the ontology category
a_M_b_I_gois the number of genes in the intersection of gene list a and the ontology category, but not in gene list b
b_M_a_I_gois the number of genes in the intersection of gene list b and the ontology category, but not in gene list a
a_M_bis the number of genes in gene list a, but not in gene list b
b_M_ais the number of genes in gene list b, but not in gene list a
chi_sqis the probability that genes from the category are equally likely to
come from list A or B
test_obsis the test statistic
Note that the statistical test used for two-list comparisons is different, depending on their set-theoretical relationship. There are four different cases:
The lists are overlapping
In this case the Leisering, Alonzo and Pepe test (LAP) is used, since Fisher's exact is inappropriate when the categories are not mutually exclusive.
The lists are mututally exclusive
In this case Fisher's Exact is used.
One list is a proper subset of the other
This is an exceedingly rare case, where Fisher's exact can be used (but this is not implemented yet.) One list being a proper subset of another is expected only to happen in artificial situations, where the lists represent the same results, only with different p-value cutoffs.
When Fisher's Exact test is used, the columns
test_obs are replaced with
The lists are exactly equal
In the case of equal lists, no comparison is done and goverlap executes as if only one list was input.
$ pip install goverlap
Command line interface
goverlap goverlap is a modern ontology term enrichment analyzer (Visit github.com/endrebak/goverlap for examples, docs and help.) Usage: goverlap --genes-a=A [--genes-b=B] [--experiment=EXP] [--species=SPE] [--ontologies=ONT] [--limit=LIM] [--ncpus=CPU] [--verbosity=VRB] goverlap --help Arguments: -a A --genes-a=A newline-separated list of the genes in experiment A Options: -h --help show this message -b B --genes-b=B newline-separated list of the genes in experiment B -s SPE --species=SPE species to do analysis on (human, mouse or rat) [default: human] -o ONT --ontologies=ONT comma-separated list of ontologies [default: CC,MF,BP,KEGG] -l LIM --limit=LIM remove categories where the number of genes are more than LIM percent of all genes [default: 0.25] -x EXP --experiment=EXP newline-separated list of all the genes in the experiment -v VRB --verbosity VRB options: quiet, info, debug [default: info] -n CPU --ncpus=CPU number of cores used [default: 1] Note: The gene names need to be of the kind called "external_gene_name"/"associated gene name" in BioMart. If they are in any other format, you need to use a BioMart id converter to convert them to this format. (see BioMart.org or github.com/endrebak/biomartian)
# input files (these can be found in the `examples` directory): $ head -3 examples/Simes_Binner_ECIIvsDeep.txt Ablim2 Elp5 Vmac $ head -3 examples/FeatureCount_ECIIvsDeep.txt Slc12a4 Lcat Tle4 $ goverlap -a examples/FeatureCount_ECIIvsDeep.txt \ # first input list -b examples/Simes_Binner_ECIIvsDeep.txt \ # second input list -s rat \ # the species name; used for gene and ontology lookups -v debug \ # how much output about the running process to print to stderr -n 30 -o CC,KEGG \ # which ontologies to use -l 0.10 \ # remove any ontology category where the number of associated genes is \ # more than 10% of the total number of genes. -x examples/experiment_genes.txt # restrict the genes considered to this list
This produces the output
# /local/home/endrebak/anaconda/bin/goverlap -a examples/FeatureCount_ECIIvsDeep.txt -b examples/Simes_Binner_ECIIvsDeep.txt -s rat -v debug -n 30 -o CC -l 0.1 -x examples/experiment_genes.txt go_root ontology chi_sq_fdr a_hpgm_go_fdr a_odds b_hpgm_go_fdr b_odds a_I_b_I_go a_M_b_I_go b_M_a_I_go go_subtree U a_I_b a_M_b b_M_a chi_sq test_obs b_hpgm_go a_hpgm_go go_annotation GO:0030425 CC 1.0 0.001 0.03 4.38e-11 0.05 184 84 7 369 15945 6170 4073 131 0.8 0.06 4.18e-13 4.04e-05 dendrite GO:0000813 CC 1.0 1.0 0.0001 1.0 0.0 1 1 0 8 15945 6170 4073 131 0.9 5.3e-08 1.0 0.9 ESCRT I complex GO:0034098 CC 1.0 1.0 0.0003 1.0 0.0004 2 1 0 6 1594 6170 4073 131 0.99 1.05e-07 0.57 0.7 VCP-NPL4-UFD1 AAA ATPase complex GO:0034666 CC 1.0 0.7 0.0003 0.16 0.0009 2 0 0 2 15945 6170 4073 131 1.0 0.0 0.02 0.14 integrin alpha2-beta1 complex GO:0033256 CC 1.0 0.15 0.001 0.34 0.001 6 3 0 7 15945 6170 4073 131 0.9 2.8e-06 0.05 0.01 I-kappaB/NF-kappaB complex GO:0000308 CC 1.0 1.0 0.0002 0.9 0.0005 1 0 0 2 15945 6170 4073 131 1.0 0.0 0.2 0.6 cytoplasmic cyclin-dependent protein kinase holoenzyme complex GO:0044308 CC 1.0 1.0 0.0002 1.0 0.0 1 1 0 2 15945 6170 4073 131 0.9 5.2e-08 1.0 0.6 axonal spine GO:0032591 CC 1.0 0.3 0.0008 0.3 0.001 5 2 0 6 15945 6170 4073 131 0.9 1.05e-06 0.03 0.03 dendritic spine membrane ...
Tested on both
All the required libraries are installed automatically when fetching goverlap from PyPI.
pip install goverlap # also installs pandas, numpy/scipy, joblib, godb, biomartian, kg, ebs
To run the tests you also need
sh tests/run_unit_tests.sh and
Report any failing tests on the issues page.
Papers or studies using goverlap
Several high-profile papers to be added (will probably take years, though.)