Converting Protein Names To Gene Ids
3
2
Entering edit mode
9.1 years ago
kanwarjag ★ 1.1k

I have a proteomic data which has protein names and I want to overlay with RNA-seq data based on gene ids (enterz gene id or ensemble id etc) I looked at biomart, David, Id converter all of them are returning very few hits or none. Is there a good tool out there where I can convert protein names to gene ids.

Thanks

id • 14k views
0
Entering edit mode

Which organism? And can you post an example of a "protein name".

0
Entering edit mode

It is mouse few protein names are - RAB7A TBB2A KPYR NDKA

0
Entering edit mode

It is strange not to get results with Biomart ? Be careful that for Mus Musculus, gene symbols are usually in lower case (Tnf instead of TNF) and maybe the search is case-sensitive. By the way, I tried your 4 protein name and found no answer in MGI symbol. Uniprot Gene name found 1 / ?? If it is proteomic data, maybe you have the uniprot accession IDs ? (something like "Q8K2Q7") and it might be easier to use that to get ensembl IDs via biomart ? Julien

0
Entering edit mode

No this is exactly proteomic core gave one mistake I made is names are as CPSMMOUSE HBB1MOUSE ALBUMOUSE GSTM1MOUSE FTHFD_MOUSE

If that makes any difference it says it is from Sprot_54.0 database

0
Entering edit mode

Those look like they are UniProtKB entry names:

The other ones you mention are missing the species suffix, but can also be found:

Unfortunately UniProtKB entry names are unstable, so while you may be able to find most of them in UniProtKB without any problems, some will have changed and will be a little harder to find (e.g. FTHFD_MOUSE). This is why UniProt recommends the use of accession numbers, over the more human friendly entry names.

Luckily you know which version of UniProtKB/Swiss-Prot these came from (Sprot_54.0 => UniProtKB/Swiss-Prot 54.0 (2007-07-24)) which means you can use the UniProtKB Sequence/Annotation Version Archive (UniSave) to resolve the entry names to the the specific UniProtKB entries that were used to generate the annotations, and from these entries get the UniProtKB accessions and the associated gene names. Given those then mapping to Entrez Gene and/or Ensembl is simple and can be done using direct queries in those resources, or mapping services such as the UniProt Database identifier mapping service.

3
Entering edit mode
9.1 years ago
qiyunzhu ▴ 430

I think you can make a script to connect NCBI server to look up the information you desire. Here's the tutorial: http://www.ncbi.nlm.nih.gov/books/NBK25500/

In my case I did something like (Perl):

use LWP::Simple;
$query = "http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=protein_name";$result = get $query; push (@GIs,$1) while $result =~ s/<Id>(\d+)<\/Id>//s; foreach (@GIs){$query = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=protein&id=$_";$result = get \$query;
# some code to handle result... #
}


This code shows how you can look up a protein name from NCBI server and get a list of protein GIs and then get the details of each protein. In each GenBank format record of protein, go to the "CDS" section and you will see the coordinates of the corresponding nucleotides, then you will know whether your RNAseq reads overlaps the coding region.

Note that a descriptive protein name does not necessarily link to one unique protein. Maybe you need GIs or accession numbers instead of names.

0
Entering edit mode

Original questioner said: "Thanks. Is there a good way to convert protein names to any protein ids/ Gis. I used to use R script for converting gene nnames to Enterz id/refseq ids etc but has no clue about protein names. Thanks."

0
Entering edit mode

Hello, I think the code I posted does convert protein names to protein GIs. However, please let me know what does your "protein name" mean? Is it something like "BRCA1 [Homo sapiens]"?

0
Entering edit mode

I posted above the info

0
Entering edit mode
5.0 years ago
Zhilong Jia ★ 2.0k

Uniport has a Retrieve/ID mapping server, which can get gene symbol based on protein name. such as this example. As a result, bioconductor package UniProt.ws can be used for R programming.

If conversion between mouse/rat and human gene symbols, HCOP will be useful as it provides Bulk Downloads, which can be used to map via programming.