I have a metagenomics dataset with de-novo open reading frames [ORF] calls analogous to genes. In this dataset, I have metagenomic assembled genomes [MAGs] and for each MAG I have clusters of ORFs. I have ran KOFAMSCAN to get KEGG orthologs for the ORFs.

I want ask the following statistical question:

**For each MAG-specific cluster, which KEGG pathways are enriched?**

I've already extracted all of info from this KEGG htext file so I have a gmt formatted table that basically has the following structure: `[KEGG_ID]\t[KEGG_PATHWAY_DESCRIPTION]\t[KO_1]\t[KO_2]\t...[KO_N]`

(though, I also have this in a python dictionary `{"kegg_pathway":{KO_1, KO_2, ..., KO_N}`

).

Additionally, I have a vector of positive floats that I could also use for each of the KO specific to each cluster.

My first idea was to try GSEA via GSEApy using `prerank`

module. I got this to run but realized it wants positive and negative weights so I don't think I'm using this appropriately. It seems to be specific to phenotypic data and I only have one condition.

My next idea was to try to use a hypergeometric test, though I'm not sure how this can be done exactly: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.hypergeom.html

For example if I went down this path, would I do this for multiple iterations and what would I use for the `M`

, `n`

, and `N`

**Can someone direct me to a software package (preferably Python or command line but R is fine if the implementation is straightforward), how to use the hypergeometric test properly, or a better way to go about solving this problem?**

A recap of my dataset:

KEGG_DATABASE

- Mapping between KEGG pathways to a list of KEGG orthologs in said pathway (GMT file and Python dictionary)

METAGENOME

- MAGs
- Cluster of ORFs in MAG
- KEGG ortholog mapping for ORFs in cluster
- Vector of positive floats with "weight" for KEGG ortholog w.r.t. each MAG or Cluster (the details aren't important)

- MAGs

Thanks for the response. Expanding this out a bit. We have the following:

`MAG_1`

has 1000 genes`gene_set_A`

has 50 genes`query_cluster`

has 20 genes`query_cluster`

are in`gene_set_A`

In the example here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.hypergeom.html

Would the way to generalize this to our problem be the following:

The survival function

`sf`

might be more straight forward in this case, since it will calculate the probability of k or more successes, which is identical to the p-value of the upper-tail.