Question: Current sources for pathway data/gene sets
gravatar for Lou
5.7 years ago by
Lou10 wrote:

Hi All, 

Does anyone know of any good resource for getting current Kegg, Biocarta and Reactome pathway/gene set data in .gmt format ? I have used the MsigDB C2 gene sets in the past but these are now quite outdated. Any advice would be much appreciated.


ADD COMMENTlink modified 5.7 years ago by B. Arman Aksoy1.2k • written 5.7 years ago by Lou10
gravatar for B. Arman Aksoy
5.7 years ago by
B. Arman Aksoy1.2k
New York, NY
B. Arman Aksoy1.2k wrote:

It doesn't have KEGG and Biocarta, but you can download the GSEA filesfor various data sources from the latest Pathway Commons 2 web service:

These files contain UniProt IDs, though. You might need to map these back to some other type of id depending on your need; but PC2 also has those mappings for you:

ADD COMMENTlink written 5.7 years ago by B. Arman Aksoy1.2k

Hi Arman, 

Thanks for the tip, that seems fairly straightforward to implement, will give it a try today.


ADD REPLYlink written 5.7 years ago by Lou10
gravatar for Chris Evelo
5.7 years ago by
Chris Evelo10.0k
Maastricht, The Netherlands
Chris Evelo10.0k wrote:

Is this what you mean by .gmt format? 

That is essentially just a list of genes for each pathway. Since most resources allow you to download such a list of genes per pathways as a flat file you could easily create those yourself.

Update. It turns out we actually have a .gmt file for the whole WikiPathways collection. We now made it available here:  


ADD COMMENTlink modified 5.6 years ago • written 5.7 years ago by Chris Evelo10.0k

Hi Chris,

Thanks for your reply. Yes, that is exactly what I meant!

The main reason I am reluctant to make my own gene sets is I am unsure what kind of addition filters I should apply to each flatfile (i.e what gene sets should I exclude based on different evidence codes/other properties). For this reason I thought it would be far simpler to just use pre-compiled gene sets that use agreed upon standards. However if I can't find anything like this then I will definitely think about making my own.

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Lou10

Yes! The choice of what pathways to use really is very relevant. Online collections should make clear about what they use by adding provenance data to the sets. But that is seldom done. For WikiPathways we have a "curated" collection that is tagged and that can be downloaded as such from the download collection at the PathVisio website.

ADD REPLYlink written 5.7 years ago by Chris Evelo10.0k

Thanks again for your help.  Just to check are you referring to the "wikipathways_Homo_sapiens_Curation-AnalysisCollection__gpml" zip file from Pathvisio ?

From looking at the pathway files within this folder I can see that I could write a script that converts each file into the GSEA gmt format. However I need to know which database each pathway was originally derived from (e.g Kegg/Biocarta) which is not possible to unless you copy and paste the pathway's URL into your browser.  This means I cannot quickly bin pathways from different database into separate groups (which is something I want to do as I would like to carry out separate enrichment analyses using gene sets/pathways from different databases).

I am rather new to bioinformatics so I am not too well acquainted with the flat file format though… 

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Lou10

Yes, conversion from the gpml format would be one way to do this. There are different (easier) ways to export that list though. You could open the pathways in PathVisio which has an export option. or you could download them directly from WikiPathways after selecting the same pathway again with the correct format selected. Alternatively you could use the WikiPathways webservice or the SPARQL endpoint (which is probably most powerful).

Concerning your question about pathway collections. Most WikiPathways pathways are actually original and were created on the wiki itself or by the original GenMAPP project that preceded it. Some were indeed converted and important collections are NetPath and Reactome. These have their own portal which you could use to select them. I don't think we actually have a lot of Biocarta pathways. KEGG is special because of the licensing problems. I would not advise you to get these from WikiPathways! Although we do have some pathways that are based on KEGG pathways, but these were really extended with newer information.

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Chris Evelo10.0k

Thanks again, I want to use the entire list of pathways so guess I will just read the files into R and edit them in a short loop. Pathviseo will be useful for post-hoc for visualizing any significant pathways that pop up though! Yes, I just found out about the Kegg license so I will probably give these a miss since I doubt my institute wants to buy a license. For any others with the same question, up to date GO pathways in gmt format can be obtained via the GO2MSIG tool:

ADD REPLYlink written 5.6 years ago by Lou10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1163 users visited in the last hour