Question: the "major" problem of gene identifiers conversion
0
gravatar for Abdullah
3.5 years ago by
Abdullah100
Germany
Abdullah100 wrote:

Hi,

I know this question have been asked multiple times, but none of the mentioned answers were "satisfactory".

I managed to get a list of genes for each KEGG pathway using the kg tool (https://www.biostars.org/p/163870/). However, when I try to convert this list to other identifier type, a big problem arise.

Since I want an automatic way and from the suggestions in the mentioned question, I decided to use the python wrapper of MyGene.info 

import mygene

mg = mygene.MyGeneInfo()

allGeneSymbols = ["DP2", "DP1", "MAD3L"]

out = mg.querymany(allGeneSymbols, scopes='symbol', fields='entrezgene', species='human')

It worked only for a small set of genes and the problem seems to be the naming. For example, one of the genes where no conversion can be achieved is called DP2 in the KEGG list. However, when I dig a bit more, I was able to find this gene within the MyGene.info using http://mygene.info/v2/query?q=DP2 and it is named "TFDP2"

{"hits": [{"symbol": "PTGDR2", "_id": "11251", "entrezgene": 11251, "_score": 0.7157431, "name": "prostaglandin D2 receptor 2", "taxid": 9606}, {"symbol": "TFDP2", "_id": "7029", "entrezgene": 7029, "_score": 0.6262752, "name": "transcription factor Dp-2 (E2F dimerization partner 2)", "taxid": 9606}, {"symbol": "APC", "_id": "324", "entrezgene": 324, "_score": 0.58416, "name": "adenomatous polyposis coli", "taxid": 9606}], "max_score": 0.7157431, "took": 4, "total": 3}

which shows why it has not been found using the python script!!!

any suggestions on a better way to handle such a problem? I mean one option would be to get the JSON output with curl and do something with it (not the best way). Another option would be to use Reactome, but this would require re-writing everything to deal with the reactome hierarchy and get the genes and so on (unless some tool already exist to do this).

 

EDIT:

 

One more way that I found where one could get all the KEGG genes (Entrez ID) is downloading the data from GSEA (e.g., http://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/5.0/c2.cp.kegg.v5.0.entrez.gmt). However, building the KEGG hierarchy from this file is simply not possible which does not solve my problem.

gene • 1.7k views
ADD COMMENTlink modified 3.5 years ago by Jean-Karim Heriche18k • written 3.5 years ago by Abdullah100

Does this mean that the you were not able to convert all the gene names from kegg even using mygene.info? (I am currently investigating the same problem, but have found no satisfactory solutions.)

ADD REPLYlink written 3.5 years ago by Endre Bakken Stovner880

exactly. I could not convert all the kegg genes even using mygene.info. I showed an example where I know why the conversion did not work.

ADD REPLYlink written 3.5 years ago by Abdullah100
3
gravatar for Jean-Karim Heriche
3.5 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche18k wrote:

There are several options to get the Entrez gene IDs for genes in a KEGG pathway:
- Use the KEGG REST API directly: e.g. http://rest.kegg.jp/get/hsa05130
In the gene section, there's one line for each gene and the first entry on the line is the Entrez gene ID.
- Use the TogoWS REST API:
http://togows.dbcls.jp/entry/pathway/hsa05200/genes
This gives you one line where all genes are separated by tab and for each gene, fields are separated by spaces. The first field is the Entrez gene ID.
- Use the Bioconductor KEGGREST package which also offers a gene ID conversion function.
 

ADD COMMENTlink written 3.5 years ago by Jean-Karim Heriche18k

Are the KEGG identifiers that look like "hsa:401105" just entrezgenes then (with a species prefix)?

ADD REPLYlink written 3.5 years ago by Endre Bakken Stovner880
1

Yes. In hsa:401105, 401105 is the Entrez gene ID.

ADD REPLYlink written 3.5 years ago by Jean-Karim Heriche18k

Thanks for your patient help.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Endre Bakken Stovner880

We will be waiting for an update on the kg tool :)

ADD REPLYlink written 3.5 years ago by Abdullah100
2
gravatar for Fidel
3.5 years ago by
Fidel1.9k
Germany
Fidel1.9k wrote:

Have you considered using other pathway database? I find Reactome to be better maintained and more informative in general (see https://www.biostars.org/p/3432/).  You can easily browse pathways and get uniprot protein identifiers that can be easily converted to other identifiers. They have a nice tool to converting identifiers and find pathway enrichments.

 

 

ADD COMMENTlink written 3.5 years ago by Fidel1.9k

Well, Reactome data does not seem to make so much sense to me. For example, if you have a look at this file (http://www.reactome.org/download/current/Ensembl2Reactome_All_Levels.txt) which should "supposedly" contain the mapping between Ensemble ID and pathways on all levels, you find only 1423 entry for Homo Sapiens and only 321 unique Ensemble ID (which is too less data, unless I'm getting something completely wrong).

 

After digging a bit more into Reactome, I was able to find this non-public file (http://www.reactome.org/download/current/homo_sapiens_ensembl_gene_to_pathways.csv) which also "supposedly" contain the mapping between Human Ensemble IDs and pathways. This file contains 7126 unique Ensemble IDs which makes more sense. If this is correct, one needs to re-build the hierarchy of Reactome pathways using (http://www.reactome.org/download/current/ReactomePathwaysRelation.txt)

So here are two contradictory outputs. I assume the second one is correct, but who knows.

ADD REPLYlink written 3.5 years ago by Abdullah100

I looked at the UniProt2Reactome.txt file which contains 8723 unique uniprot identifiers. The Ensembl2Reactome file seems truncated or maybe they don't have many mappings to Ensembl, but Uniprot should be the primary and more reliable identifier which you can map to other identifiers, for example using biodbnet http://biodbnet.abcc.ncifcrf.gov/

You may want to contact reactome directly, they are very helpful (help@reactome.org)

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Fidel1.9k
1
gravatar for Jean-Karim Heriche
3.5 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche18k wrote:

Use database gene identifiers, they are more stable than gene names or gene symbols. The names/symbols change over time and contrary to database IDs, these changes are not tracked.

ADD COMMENTlink written 3.5 years ago by Jean-Karim Heriche18k

Can you please expand upon what a database gene identifier is and how the asker can use them to solve his or her problem?

Is it just the field called _id above? Still, that won't help the conversion afaics.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Endre Bakken Stovner880

I mean IDs from public databases like EnsEMBL (e.g. ENSG00000178999) or NCBI's Entrez Gene (e.g. 11251). Since KEGG references genes using Entrez gene IDs, one should retrieve these IDs from KEGG (along the symbols/names if needed) and use them for conversion.

In the example above, querying with DP2 returns two entries because this was used to name two genes which are now named TFDP2 and PTGDR2 so the only way to disambiguate is to use a database ID or accession number.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Jean-Karim Heriche18k
0
gravatar for Abdullah
3.5 years ago by
Abdullah100
Germany
Abdullah100 wrote:

I came up with a weird solution to my problem. Not sure if this would help others.

What I will do is the following:

1- download the GSEA KEGG lists (Entrez IDs) from here: http://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/5.0/c2.cp.kegg.v5.0.entrez.gmt . However, those have only Pathway names which makes it hard to build the hierarchy.

2- to fix this, I will curl the path that is found inside this file for each pathway, e.g.,

curl http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_O_GLYCAN_BIOSYNTHESIS | grep -A1 'External links'

and get the field: External Links which contains the KEGG ID of this pathway. Using this, I can have all the KEGG pathway IDs along with their corresponding genes.

3- Using the pathway IDs, I can build the hierarchy using this BRET hierarchy file: http://www.genome.jp/kegg-bin/download_htext?htext=br08901.keg&format=htext&filedir=

 

 

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Abdullah100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1396 users visited in the last hour