Question

How to tell if a RefSeq or UniProt protein identifier is human?

0

Entering edit mode

10.3 years ago

pwg46 ▴ 540

Hello,

I am creating a local table to convert from RefSeq protein identifiers (E.g,NP) to UniProt identifiers. RefSeq has a file named "gene_refseq_uniprotkb_collab.gz" on their FTP server, which contains mappings from RefSeq protein to UniProt. However, there are 18 million mappings, and I want to build a table using only human identifiers. Is there an easy way to tell apart a RefSeq (or UniProt) human identifier from another species?

Thanks.

refseq uniprot human protein • 3.4k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by pwg46 ▴ 540

Ram · Answer 1 · 2015-04-04

0

Entering edit mode

10.3 years ago

Sean Davis 27k

No, I do not think that is possible using only the IDs themselves. However, you can use other resources at the NCBI ftp server to map refseq to gene id then to taxon ID.

Practically speaking, though, you may be able to use the 18 million mappings directly and not worry about pre-filtering.

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Sean Davis 27k

Ram · Answer 2 · 2015-04-06

There is a way to do this using rest-style KEGG API

You can convert NCBI gene id or uniprot ID to KEGG gene ID using a query such as (paste these on your browsers):

Now the process involves NCBI gene ID/uniprot ID --> KEGG gene ID --> KEGG genome ID --> KEGG genome name, and I'll use a one-liner such as the following to generate a 4 column tab-delimited table (1: NCBI gene ID/uniprot ID, 2: KEGG gene ID, 3: KEGG genome name, 4: KEGG genome description) on which I can either grep -i for "genome:T01001" or "Homo sapiens" to output first column containing genes from your sample that are only found in human. For example, here I am picking uniprot IDs from a genebank test.gbk file:

grep -Po "(?<=UniProtKB:)\w*" test.gbk | sort | uniq | while read l;  do curl -s http://rest.kegg.jp/conv/genes/uniprot:$l; done | while IFS=$'\t' read -r -a q; do echo -e "${q[0]}\t$(curl -s http://rest.kegg.jp/link/genome/${q[1]})"; done | while IFS=$'\t' read -r -a p; do echo -e "${p[0]}\t${p[1]}\t$(curl -s http://rest.kegg.jp/find/genome/${p[2]})"; done
up:A0QTF8    msm:MSMEG_1825    genome:T00434    msm, MYCS2, 246196; Mycobacterium smegmatis MC2 155
up:A0QTG1    msm:MSMEG_1829    genome:T00434    msm, MYCS2, 246196; Mycobacterium smegmatis MC2 155
up:A0QU63    msm:MSMEG_2091    genome:T00434    msm, MYCS2, 246196; Mycobacterium smegmatis MC2 155
up:A0QWV9    msm:MSMEG_3081    genome:T00434    msm, MYCS2, 246196; Mycobacterium smegmatis MC2 155
up:A0QYU6    msm:MSMEG_3791    genome:T00434    msm, MYCS2, 246196; Mycobacterium smegmatis MC2 155
up:A0R2D5    msm:MSMEG_5073    genome:T00434    msm, MYCS2, 246196; Mycobacterium smegmatis MC2 155
up:A0R3I8    msm:MSMEG_5488    genome:T00434    msm, MYCS2, 246196; Mycobacterium smegmatis MC2 155
up:A0R4Z6    msm:MSMEG_6009    genome:T00434    msm, MYCS2, 246196; Mycobacterium smegmatis MC2 155
up:A2REG0    spf:SpyM50907    genome:T00497    spf, STRPG, 160491; Streptococcus pyogenes Manfredo (serotype M5)
up:A2RI45    llm:llmg_0332    genome:T00475    llm, LACLM, 416870; Lactococcus lactis subsp. cremoris MG1363
up:A2RIQ0    llm:llmg_0542    genome:T00475    llm, LACLM, 416870; Lactococcus lactis subsp. cremoris MG1363
up:A2RM05    llm:llmg_1760    genome:T00475    llm, LACLM, 416870; Lactococcus lactis subsp. cremoris MG1363
up:A3DDQ3    cth:Cthe_0847    genome:T00474    cth, CLOTH, 203119; Ruminiclostridium thermocellum ATCC 27405 (Clostridium thermocellum ATCC 27405)
up:A3DHB8    cth:Cthe_2143    genome:T00474    cth, CLOTH, 203119; Ruminiclostridium thermocellum ATCC 27405 (Clostridium thermocellum ATCC 27405)
up:A6QGP7    sae:NWMN_1257    genome:T00557    sae, STAAE, 426430; Staphylococcus aureus subsp. aureus Newman
up:A6QKF4    sae:NWMN_2564    genome:T00557    sae, STAAE, 426430; Staphylococcus aureus subsp. aureus Newman
up:A6TRX5    amt:Amet_2793    genome:T00551    amt, ALKMQ, 293826; Alkaliphilus metalliredigens QYMF
up:A7MVC2    vca:M892_16180    genome:T02837    vca, 338187; Vibrio campbellii ATCC BAA-1116
up:A7MVC2    vha:VIBHAR_02959    genome:T00589    vha, VIBHB, 338187; Vibrio campbellii ATCC BAA-1116 (Vibrio harveyi ATCC BAA-1116)


up:C0H3V2    bsu:BSU03982    genome:T00010    bsu, BACSU, 224308; Bacillus subtilis subsp. subtilis 168
up:C0QYX7    bhy:BHWA1_00569    genome:T00865    bhy, TREHY, 565034; Brachyspira hyodysenteriae WA1
up:C0SP91    bsu:BSU40370    genome:T00010    bsu, BACSU, 224308; Bacillus subtilis subsp. subtilis 168
up:C0SP99    bsu:BSU03350    genome:T00010    bsu, BACSU, 224308; Bacillus subtilis subsp. subtilis 168
up:C1CMI6    spp:SPP_1876    genome:T00874    spp, STRZP, 488223; Streptococcus pneumoniae P1031
up:C3L5T6    bah:BAMEG_4595    genome:T00886    bah, BACAC, 568206; Bacillus anthracis CDC 684

up:D3DFG8    hte:Hydth_0104    genome:T02106    hte, 608538; Hydrogenobacter thermophilus TK-6
up:D3DFG8    hth:HTH_0103    genome:T01167    hth, 608538; Hydrogenobacter thermophilus TK-6
up:D3DJ42    hte:Hydth_1383    genome:T02106    hte, 608538; Hydrogenobacter thermophilus TK-6
up:D3DJ42    hth:HTH_1393    genome:T01167    hth, 608538; Hydrogenobacter thermophilus TK-6
up:D3DKC4    hte:Hydth_1815    genome:T02106    hte, 608538; Hydrogenobacter thermophilus TK-6
up:D3DKC4    hth:HTH_1832    genome:T01167    hth, 608538; Hydrogenobacter thermophilus TK-6

Mind you, it is going to be a slow process!

Best Wishes,
Umer

score 0 · Answer 3 · 2015-04-07

UniProt provides species-specific IDmapping downloads, e.g. for HUMAN:

ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/

ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/

ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/