How can I run "gene set enrichment" (not necessarily GSEA) with only one condition?
Entering edit mode
8 months ago
O.rka ▴ 390

I have a metagenomics dataset with de-novo open reading frames [ORF] calls analogous to genes. In this dataset, I have metagenomic assembled genomes [MAGs] and for each MAG I have clusters of ORFs. I have ran KOFAMSCAN to get KEGG orthologs for the ORFs.

I want ask the following statistical question:

For each MAG-specific cluster, which KEGG pathways are enriched?

I've already extracted all of info from this KEGG htext file so I have a gmt formatted table that basically has the following structure: [KEGG_ID]\t[KEGG_PATHWAY_DESCRIPTION]\t[KO_1]\t[KO_2]\t...[KO_N] (though, I also have this in a python dictionary {"kegg_pathway":{KO_1, KO_2, ..., KO_N}).

Additionally, I have a vector of positive floats that I could also use for each of the KO specific to each cluster.

My first idea was to try GSEA via GSEApy using prerank module. I got this to run but realized it wants positive and negative weights so I don't think I'm using this appropriately. It seems to be specific to phenotypic data and I only have one condition.

My next idea was to try to use a hypergeometric test, though I'm not sure how this can be done exactly:

For example if I went down this path, would I do this for multiple iterations and what would I use for the M, n, and N

Can someone direct me to a software package (preferably Python or command line but R is fine if the implementation is straightforward), how to use the hypergeometric test properly, or a better way to go about solving this problem?

A recap of my dataset:


    • Mapping between KEGG pathways to a list of KEGG orthologs in said pathway (GMT file and Python dictionary)

    • MAGs
      • Cluster of ORFs in MAG
      • KEGG ortholog mapping for ORFs in cluster
      • Vector of positive floats with "weight" for KEGG ortholog w.r.t. each MAG or Cluster (the details aren't important)
metagenomics gene set enrichment analysis GSEA • 452 views
Entering edit mode
8 months ago

Let's say that a gene term contains 50 genes from a total pool of 1000 genes, and of the 20 genes in your cluster, 5 of them are in that gene term. In scipy this would be: n = 50, M = 1000, N = 20, k=5. Remember that to derive a p-value you need to get the probability of getting that number or more extreme number of successes given the null is true.

Entering edit mode

Thanks for the response. Expanding this out a bit. We have the following:

  • MAG_1 has 1000 genes
  • gene_set_Ahas 50 genes
  • The query_cluster has 20 genes
  • 5 genes in query_cluster are in gene_set_A

In the example here:

[M, n, N] = [20, 7, 12]
rv = hypergeom(M, n, N)
x = np.arange(0, n+1)
pmf_dogs = rv.pmf(x)

Would the way to generalize this to our problem be the following:

n = 50
M = 1000
N = 20

hg = stats.hypergeom(M=M, n=n, N=N)
p_value = hg.pmf(k)
p_value # 0.0019786609386289876
Entering edit mode

The survival function sf might be more straight forward in this case, since it will calculate the probability of k or more successes, which is identical to the p-value of the upper-tail.


Login before adding your answer.

Traffic: 1578 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6