Question: Which Ensembl protein id (ENSP) should I use? (Which id is used by string-db)
gravatar for Soheil Jahangiri
5.7 years ago by
Iran, Islamic Republic Of
Soheil Jahangiri10 wrote:


I have a list of gene names(symbols) and I want to convert them into ensembl protein id(ENSP).

There are many tools like BioMart, David, bioDBnet, etc. However, all of them return multiple ensemble protein ids for a single gene. This is probably due to different transcriptions or splicing. If I want to use one of these ids, which one should I use?

In fact, I need these ids to extract ppi networks from string-db database files.

string-db uses ensembl protein ids in its database files and I don't know which ensembl id it uses for each gene.

Does anyone has any idea?!!

ADD COMMENTlink modified 3.4 years ago by mahmoud.s.fahmy0 • written 5.7 years ago by Soheil Jahangiri10
gravatar for Uma A
5.7 years ago by
Uma A220
Uma A220 wrote:

I am assuming that you want to create a network using string data, but string-db provides the interaction data for ppi, hence their file contains the network information on the transcripts level (ENSPs), which you obviously cannot use for creating a network with gene symbols. Actually you need not go to any other site than string-db for obtaining the ENSP to Gene symbol mappings. Here's what you should do:

  1. On the string download page, select the organism for which you want to download the data and then look at the "General flatfiles & full database dumps" section. You will see a "protein aliases" file link in the list, download that file.
  2. Download that file. It contains species id, protein id (ENSP), alias (gene symbol is found in this column) and source. In the source column, use sources like BLAST_UniProt_GN, Ensembl_UniProt_GN or any source that you want to add in this to obtain the lines that only map the ENSP to the gene symbols, since there are many more identifier mappings in this file. Note that for each gene symbol, there would be multiple transcripts (ENSPs) , hence multiple rows.
  3. Once you get the curated mapping list, use the string "protein links" file to obtain the network interaction data and simply replace the protein identifiers in that file with their mapped gene symbols. Now you have the string data in terms of gene symbols.
  4. Create a network using your list of gene symbols.

Note: It seems that the string-db files are being updated to v10 currently. If the file is not available right now, do check after some time to get the updated data.

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Uma A220
gravatar for Nitro_Shade
3.5 years ago by
Nitro_Shade20 wrote:


Not sure if it's still relevant or not, but I made a very small utility to do this extraction. It can be found here.

ADD COMMENTlink written 3.5 years ago by Nitro_Shade20
gravatar for Abhik
3.4 years ago by
Abhik30 wrote:

I found that in the protein alias file there is no such GENE Symbols. One of the ways to convert ENSP to HUGO gene Symbol is using script below

mart = useMart(host = '', biomart='ENSEMBL_MART_ENSEMBL', dataset='hsapiens_gene_ensembl')
mart=useDataset("hsapiens_gene_ensembl","" mart = mart)

ensembl_genes <- "ENSP00000000233"

gene_names <- getBM(
    filters= "ensembl_peptide_id", 
    attributes= c("ensembl_peptide_id","hgnc_symbol","description"),
    values= ensembl_genes,
    mart= mart)

ensembl_peptide_id hgnc_symbol                                            description
> ENSP00000000233        ARF5 ADP-ribosylation factor 5 [Source:HGNC Symbol;Acc:658]

Hope this helps.

ADD COMMENTlink written 3.4 years ago by Abhik30
gravatar for mahmoud.s.fahmy
3.4 years ago by
mahmoud.s.fahmy0 wrote:

The previous answers would do. An easier way is to use the get_aliases method from the STRINGdb directly.

ADD COMMENTlink written 3.4 years ago by mahmoud.s.fahmy0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1477 users visited in the last hour