Tool: MSigDB for Multiple Organisms in a Tidy Data Format
gravatar for igor
2.5 years ago by
United States
igor11k wrote:

There are a lot of R-based pathway analysis tools. There are also supporting data packages for the actual pathways from GO, KEGG, or Reactome. However, support for Molecular Signatures Database (MSigDB) from the GSEA within the R ecosystem is fairly limited. You have to import GMT files, re-structure the resulting objects, and potentially convert genes from human to other species. All of these are relatively trivial, but it adds up. As Hadley Wickham said: "you should consider writing a function whenever you've copied and pasted a block of code more than twice". Functions are easy to share, but datasets are trickier. So I made an R package that includes both: msigdbr (on CRAN and GitHub).

With msigdbr, you can retrieve MSigDB gene sets:

  • in an R-friendly format (a "tidy" data frame with one gene per row that work well with the tidyverse packages)
  • as both gene symbols and Entrez Gene IDs
  • for multiple frequently studied organisms (not everyone works with exclusively human data and it's easy run into problems retrieving gene orthologs)
  • that can be used and shared in a single script (without requiring additional files or an active internet connection)

There is a vignette available with more info and usage examples.

There are a few other similar existing solutions, but I couldn't find any that addressed all of my pain points. I also just wanted to make an R package and this seemed like a good idea that was simple enough to start with. This probably doesn't need to be explicitly stated, but any feedback is welcome, which is why it's good to post here.

R gsea tool pathways msigdb • 1.7k views
ADD COMMENTlink modified 9 months ago • written 2.5 years ago by igor11k

Hey Igor great package!

I was hoping to use it for some fly analysis but had two quick questions about this if you have time; in the vignette you have:

'Can’t I just convert human genes to any organism myself? Yes. A popular method is using the biomaRt package. You may still end up with dozens of homologs for some genes, so additional cleanup may be helpful.'

Could you maybe clarify a little more how you are making these connections in your package?

I see at the bottom:

'Gene homologs are provided by HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted for human genes by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN. For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.'

So that suggests to me that you are taking the human genes and finding their most likely orthologs (in each given species) based on the overlap between multiple databases. What happens in cases where the closest homolog isn't that "great" of a match (even with overlaps across different DBs)? Is there any specific scoring threshold (AA similarity etc...) or is it just based on database annotations?

Just as an example looking at the major ping-pong piRNA Drosophila protein genes we have: AGO3 and aub

The homologs for humans from the package are:


aub: PIWIL1

The current literature seems to suggest that AGO3 is slightly more similar PIWIL2. Obviously I wouldn't expect someone to go in and manually check each entry - furthermore for computational reproducibility I think it makes sense to keep the standard association methods universal. So please don't take this as criticism - just wondering how this would affect analysis.

Also how has this package been working for you? Do researchers/collaborators seem to agree with the results in general?

EDIT: Also quick suggestion. It might be nice to include the human-readable "brief description" for each of the annotation sets. Maybe an extra $description column in the m_df object.


CSR_EARLY_UP.V1_UP == Genes up-regulated in early serum response of CRL 2091 cells (foreskin fibroblasts).

It was pretty easy for me to do myself by downloading the card from msigdb but just thought it might be good to already include in a "all-in-one" package such as this one - especially if less "computationally experienced" people want to use it.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by benformatics1.9k

Thanks for trying the package!

I only keep the homologs that are listed in multiple databases and keep the one that is confirmed by most (some genes can return dozens of homologs across different resources). I don't expect every gene to be correct, but using gene sets should hopefully balance that out.

I work mostly with mammalian genomes, so the homologs are usually very obvious and the pathways should be similar. I can't really comment on more distant species. For some context, some of the MSigDB gene sets are based on zebra fish data, so cross-species comparisons are acceptable to some degree.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by igor11k

Hi Igor and All, thank you for developing this tool! I am able to access 2 MSigDB gene sets with the following:

m_df.c2.c3 = msigdbr(species = "Mus musculus") %>% dplyr::filter(gs_cat == "C3" | 
                                                            gs_cat == "C2")

This gives me C2 and C3 MSigDBs in one object. I'm interested in using these two gene set collections for GSVA analysis (Hanzelmann et al 2013 and Hanzelmann et al 2019). This package takes as input either GeneSetCollection from GSEAbase or a list object "with names identifying gene sets and each entry of the list containing the gene identifiers of the genes forming the corresponding set." My question is if there's a way to generate a list() object from m_df.c2.c3 here such that it's a list of gene sets with EntrezIDs in each gene set? Or if it's possible to go directly from m_df.c2.c3 to a GeneSetCollection object?

Thank you in advance for your time!

ADD REPLYlink written 9 months ago by TJ40

I prefer to use a standard list as opposed to a more specialized object GeneSetCollection. You can create a list with: split(x = m_df.c2.c3$entrez_gene, f = m_df.c2.c3$gs_name) if you want to use Entrez IDs. You can also use gene symbols. I clarify this because after seeing hundreds of expression datasets, I have never seen any with Entrez IDs. They only seem to come up in the context of pathways analysis, making many tutorials unnecessarily complicated, forcing users to convert to and/or from Entrez IDs.

If you are wondering, there is actually a example in the vignette (for running fgsea):

ADD REPLYlink modified 9 months ago • written 9 months ago by igor11k

This worked, thank you! I'm using Entrez IDs for pathway analysis as you alluded to. My initial expression matrix is a cells x features matrix with features as gene symbols. However, the gsva package recommends Entrez IDs, so I converted the gene symbols to Entrez IDs to keep it all consistent. I'm going to take the list of gene sets from here to generate a cells x gene sets matrix for further down stream analysis (clustering, etc...). Thanks again for the speedy reply!

ADD REPLYlink written 9 months ago by TJ40

Happy to hear it worked!

Does gsva recommend Entrez IDs or do they just use them in their examples (possibly for historical reasons)?

Every time you convert your genes from one format to another, you lose some information. If you already have gene symbols, you may as well keep everything as gene symbols if you can.

ADD REPLYlink modified 9 months ago • written 9 months ago by igor11k

Thanks for pointing that out. I actually did lose some genes when I went from symbols to Entrez ID. I just went back to read the paper. Initially, I thought that gsva only uses Entrez ID and leaves it up to the user to make the conversion, unless geneset is provided as a GeneSetCollection, in which case it does the transformation internally. But in re-reading (quote from Hanzelmann et al 2019):

When the two main arguments are an ExpressionSet object and a GeneSetCollection object, the gsva() function will first translate the gene identifiers used in the GeneSetCollection object into the corresponding feature identifiers of the ExpressionSet object, according to its corresponding annotation package. This translation is carried out by an internal call to the function mapIdentifiers() from the GSEABase package. This means that both input arguments may specify features with different types of identifiers, such as Entrez IDs and probeset IDs, and the GSEABase package will take care of mapping them to each other. A second filtering step is applied that removes genes without matching features in the ExpressionSet object. If the expression data is given as a matrix object then only the latter filtering step will be applied by the gsva() function and, therefore, it will be the responsibility of the user to have the same type of identifiers in both the expression data and the gene sets.

Now that I found msigdbr, I will go back to leave my ExpressionSet in gene symbol format and use mouse gene symbols from msigdbr to see if gsva() will handle it. Thanks again!

ADD REPLYlink written 9 months ago by TJ40

Just to follow up: I was able to keep all of my genes without having to convert from gene symbol to Entrez ID. Kept everything in gene symbol format and gsva() ran without issues. Thanks again!

ADD REPLYlink written 9 months ago by TJ40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1181 users visited in the last hour