Question

R code for gene set enrichment analysis

0

Entering edit mode

2.0 years ago

j.jacob1 • 0

Hi,

I'm doing an online Omics data analysis course. I don't understand how specific parts of the code chunks below work, and I'd really appreciate someone helping me to understand this code, as plainly as possible. The aim is to compute a null distribution of gene set enrichment (GSEA) scores. To do this the samples are permuted, the statistical analysis is repeated, the genes are re-ordered according to the new statistics, and a new GSEA score is computed.

packages needed:

library(tidyverse)
library(DESeq2)
library(GO.db)
library(org.Hs.eg.db)

Choose the gene set of interest (this part is fine, no explanation needed):

GO_0005925 <- AnnotationDbi::select(org.Hs.eg.db,
                                    keys = "GO:0005925",
                                    columns = c("ENTREZID", "SYMBOL"),
                                    keytype = "GO") %>%
    as_tibble %>%
    filter(!duplicated(ENTREZID)

I think I know what the next code chunk is doing. x is the DESeq object. x$Condition is the colData column which specifies the experimental conditions - either 'mock' transfection (3 samples) or the same genetic perturbation (3 samples). There are 6 samples in total.

##' A function that assigns the permuted Conditions,
##' runs DESeq and a GSEA analysis
##'
##' @param x a DESeq object.
##' @param perm `character()` of length `ncol(x)` indicating the
##'     permutation to be tested.
##'
##' @return `numeric()` containing the GSEA path.
gsea_perm <- function(x, perm) {
    ## Set the Condition based on the permutation
    x$Condition[perm] <- "mock"
    x$Condition[-perm] <- "KD"
    ## Run DESeq2, extract and annotate results
    suppressMessages(
        tbl <- DESeq(x) %>%
            results(name = "Condition_KD_vs_mock") %>%
            as_tibble(rownames = "ENSEMBL") %>%
            left_join(ensembl_to_geneName) %>%
            mutate(inGO = ENTREZID %in% GO_0005925$ENTREZID) %>%
            filter(!is.na(ENTREZID),
                   !is.na(padj),
                   !duplicated(ENTREZID)) %>%
            dplyr::select(ENSEMBL, ENTREZID, padj, inGO) %>%
            arrange(padj)
    )
    ## Compute GSEA results
    tbl <- tbl %>%
        mutate(score = if_else(inGO, set_ratio(tbl), -1)) %>%
        mutate(score = cumsum(score))
    return(tbl$score)
}

Because there are 6 samples, they are permuted like so. Only the first ten (out of twenty) permutations are subsequently used:

(perms <- combn(6, 3))

This next code chunk is the one I'm struggling to understand. dds is a DESeq object. What is p and why function(p) in the code chunk? The function is only run on the first ten permutations. Scores are then calculated:

paths <- apply(perms[, 1:10], 2, function(p) gsea_perm(dds, p))
scores <- sapply(paths, max)

A plot is then made comparing the actual score to the permutation score. How is the first element known to be the actual score? I think that is because the first permutation in combn(6, 3) above is '1, 2, 3' which happens to be the correct experimental design (samples 1, 2 and 3 are the 'mock' transfection), but I could be wrong.

plot(density(scores))
rug(scores)
abline(v = scores[1])

Apologies for the lengthy question but I would really be grateful for some help with the specific queries I raised. I know I can use ClusterProfiler for the same ends but I am keen not to gloss over this R code without understanding it thoroughly.

r enrichment rna-seq statistics gsea • 613 views

ADD COMMENT • link 2.0 years ago by j.jacob1 • 0