R code for gene set enrichment analysis
0
0
Entering edit mode
2.0 years ago
j.jacob1 • 0

Hi,

I'm doing an online Omics data analysis course. I don't understand how specific parts of the code chunks below work, and I'd really appreciate someone helping me to understand this code, as plainly as possible. The aim is to compute a null distribution of gene set enrichment (GSEA) scores. To do this the samples are permuted, the statistical analysis is repeated, the genes are re-ordered according to the new statistics, and a new GSEA score is computed.

packages needed:

library(tidyverse)
library(DESeq2)
library(GO.db)
library(org.Hs.eg.db)

Choose the gene set of interest (this part is fine, no explanation needed):

GO_0005925 <- AnnotationDbi::select(org.Hs.eg.db,
                                    keys = "GO:0005925",
                                    columns = c("ENTREZID", "SYMBOL"),
                                    keytype = "GO") %>%
    as_tibble %>%
    filter(!duplicated(ENTREZID)

I think I know what the next code chunk is doing. x is the DESeq object. x$Condition is the colData column which specifies the experimental conditions - either 'mock' transfection (3 samples) or the same genetic perturbation (3 samples). There are 6 samples in total.

##' A function that assigns the permuted Conditions,
##' runs DESeq and a GSEA analysis
##'
##' @param x a DESeq object.
##' @param perm `character()` of length `ncol(x)` indicating the
##'     permutation to be tested.
##'
##' @return `numeric()` containing the GSEA path.
gsea_perm <- function(x, perm) {
    ## Set the Condition based on the permutation
    x$Condition[perm] <- "mock"
    x$Condition[-perm] <- "KD"
    ## Run DESeq2, extract and annotate results
    suppressMessages(
        tbl <- DESeq(x) %>%
            results(name = "Condition_KD_vs_mock") %>%
            as_tibble(rownames = "ENSEMBL") %>%
            left_join(ensembl_to_geneName) %>%
            mutate(inGO = ENTREZID %in% GO_0005925$ENTREZID) %>%
            filter(!is.na(ENTREZID),
                   !is.na(padj),
                   !duplicated(ENTREZID)) %>%
            dplyr::select(ENSEMBL, ENTREZID, padj, inGO) %>%
            arrange(padj)
    )
    ## Compute GSEA results
    tbl <- tbl %>%
        mutate(score = if_else(inGO, set_ratio(tbl), -1)) %>%
        mutate(score = cumsum(score))
    return(tbl$score)
}

Because there are 6 samples, they are permuted like so. Only the first ten (out of twenty) permutations are subsequently used:

(perms <- combn(6, 3))

This next code chunk is the one I'm struggling to understand. dds is a DESeq object. What is p and why function(p) in the code chunk? The function is only run on the first ten permutations. Scores are then calculated:

paths <- apply(perms[, 1:10], 2, function(p) gsea_perm(dds, p))
scores <- sapply(paths, max)

A plot is then made comparing the actual score to the permutation score. How is the first element known to be the actual score? I think that is because the first permutation in combn(6, 3) above is '1, 2, 3' which happens to be the correct experimental design (samples 1, 2 and 3 are the 'mock' transfection), but I could be wrong.

plot(density(scores))
rug(scores)
abline(v = scores[1])

Apologies for the lengthy question but I would really be grateful for some help with the specific queries I raised. I know I can use ClusterProfiler for the same ends but I am keen not to gloss over this R code without understanding it thoroughly.

r enrichment rna-seq statistics gsea • 613 views
ADD COMMENT

Login before adding your answer.

Traffic: 2493 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6