There are a lot of R-based pathway analysis tools. There are also supporting data packages for the actual pathways from GO, KEGG, or Reactome. However, support for Molecular Signatures Database (MSigDB) from the GSEA within the R ecosystem is fairly limited. You have to import GMT files, re-structure the resulting objects, and potentially convert genes from human to other species. All of these are relatively trivial, but it adds up. As Hadley Wickham said: "you should consider writing a function whenever you've copied and pasted a block of code more than twice". Functions are easy to share, but datasets are trickier. So I made an R package that includes both: msigdbr (on CRAN and GitHub).
With msigdbr, you can retrieve MSigDB gene sets:
- in an R-friendly format (a "tidy" data frame with one gene per row that work well with the tidyverse packages)
- as both gene symbols and Entrez Gene IDs
- for multiple frequently studied organisms (not everyone works with exclusively human data and it's easy run into problems retrieving gene orthologs)
- that can be used and shared in a single script (without requiring additional files or an active internet connection)
There is a vignette available with more info and usage examples.
There are a few other similar existing solutions, but I couldn't find any that addressed all of my pain points. I also just wanted to make an R package and this seemed like a good idea that was simple enough to start with. This probably doesn't need to be explicitly stated, but any feedback is welcome, which is why it's good to post here.