How to get the name of a protein by knowing only the RefSeq accession number?
3
0
Entering edit mode
2.8 years ago
Alex ▴ 10

Hi everyone! I have a fasta file with amino acid sequences which have only the RefSeq accession number (e.g. WP_ + 9 digits) and I'm trying to get the name of the proteins so that I can add them to the fasta ids. Here is an example:

>WP_051684486.1
MSIFGEQFLARRNRDQIDLDNALQDVYEAVTGRESIRYSINSDEQVRKELERICFYLGVR
ADQDVPEYNDLEDMLDYITRPFAIMRRHILLTHHWWKNGDGPLLVSKKDSDELLALIPGR
LGGYYYTDFRSNKKIKLDRHNAGEFEKEAICFYKPLPLSSLSANELTGLLFKNMAAADLA
MLVLSGIGIVGVSLLIPFATKMVFEYVIPTGAMTLVGSFSFLLISSAMVAYIIAVIKQGY
ADRVKVRMEVYLTHGVMGRMINFPTSFFASKSTGELYRVFDNLREIPQILIDSVIVPIID
ISLAMLFIIQIAVIVPELLVPAVITVLLQFVCMAIGTFQAYGLLNIELQQDRKIQGLAIS
VYEGIQRIKLSGSESRIMAKWAGLYSKKAKVAYPAVFPVRFQTEMIAFISMMGMLAAFYK
GFTDNISISQFVAFVAAFGMLTGSITAFSNKSKDVIKLKPVLKMSDEILKECPEVSKEKL
IVDHLSGKIEVKDLTFRYGRDLPLILDGVSFTVHPGEYVAIVGKSGCGKSTLVRIFMGFE
KAVSGSVSYDDIDVERIDPRSLRRSIGVVMQSGNLFYDSIYRNIAISAPGLSMEEAWEAA
EKAGIAEDIRNMPMKMKTLIPQGGGGISGGQRQRIMIARALAAKPNILIFDEATSALDNI
TQKVVQDSLDQLNCTRIVIAHRLSTIQNCDRILVLDKGRIIEEGNYQELLKKGGFFANLI
KRQQL

On the NCBI RefSeq site, this maps to "ATP-binding cassette domain-containing protein", so I want to add that to the identifier in order to get:

>WP_051684486.1|ATP-binding cassette domain-containing protein
MSIFGEQFLARRNRDQIDLDNALQDVYEAVTGRESIRYSINSDEQVRKELERICFYLGVR
ADQDVPEYNDLEDMLDYITRPFAIMRRHILLTHHWWKNGDGPLLVSKKDSDELLALIPGR
LGGYYYTDFRSNKKIKLDRHNAGEFEKEAICFYKPLPLSSLSANELTGLLFKNMAAADLA
MLVLSGIGIVGVSLLIPFATKMVFEYVIPTGAMTLVGSFSFLLISSAMVAYIIAVIKQGY
ADRVKVRMEVYLTHGVMGRMINFPTSFFASKSTGELYRVFDNLREIPQILIDSVIVPIID
ISLAMLFIIQIAVIVPELLVPAVITVLLQFVCMAIGTFQAYGLLNIELQQDRKIQGLAIS
VYEGIQRIKLSGSESRIMAKWAGLYSKKAKVAYPAVFPVRFQTEMIAFISMMGMLAAFYK
GFTDNISISQFVAFVAAFGMLTGSITAFSNKSKDVIKLKPVLKMSDEILKECPEVSKEKL
IVDHLSGKIEVKDLTFRYGRDLPLILDGVSFTVHPGEYVAIVGKSGCGKSTLVRIFMGFE
KAVSGSVSYDDIDVERIDPRSLRRSIGVVMQSGNLFYDSIYRNIAISAPGLSMEEAWEAA
EKAGIAEDIRNMPMKMKTLIPQGGGGISGGQRQRIMIARALAAKPNILIFDEATSALDNI
TQKVVQDSLDQLNCTRIVIAHRLSTIQNCDRILVLDKGRIIEEGNYQELLKKGGFFANLI
KRQQL

How would I go about this?

BLAST RefSeq NCBI • 1.6k views
ADD COMMENT
2
Entering edit mode

I haven't used RefSeq before. Is there a way to get all RefSeq definitions as a file? If so, you can use some basic text processing in Unix to map to your .fa.

ADD REPLY
0
Entering edit mode

I know that every RefSeq accession has a Identical Protein Groups page on NCBI ( in the case above it's https://www.ncbi.nlm.nih.gov/ipg/WP_051684486.1 ) where I can see the protein's annotation and download a csv/fasta file with the annotated sequence, but I honestly don't know if there is a way to get all the RefSeq definitions as a file.

ADD REPLY
1
Entering edit mode

Are you dealing only with C.aminophilum? Also: https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#allcomplete

ADD REPLY
0
Entering edit mode

There are protein sequences from all types of prokaryotic organisms. I have tried querying IPG myself using the search string my other colleagues used, and I was given the sequences fully annotated as I expected, so I guess it was an error or some kind of preprocessing from their part!

ADD REPLY
2
Entering edit mode

You can use Entrez eutils to do that. I think you are limited to 3 queries per second.

ADD REPLY
3
Entering edit mode
2.8 years ago
GenoMax 141k

@drew.b.ferrell has already demonstrated one way of doing this. If all of your sequences are from "RefSeq Protein" then you could download "refseq_protein" blast database indexes from NCBI via this page. Use the method above from drew.b.ferrell to get accession numbers.

Then you can use blastdbcmd utility from blast+ package to retrieve sequences you need (sequence truncated).

$ blastdbcmd -db refseq_protein -entry WP_051684486.1 -outfmt %f
>WP_051684486.1 ATP-binding cassette domain-containing protein [[Clostridium] aminophilum]
MSIFGEQFLARRNRDQIDLDNALQDVYEAVTGRESIRYSINSDEQVRKELERICFYLGVRADQDVPEYNDLEDMLDYITR
PFAIMRRHILLTHHWWKNGDGPLLVSKKDSDELLALIPGRLGGYYYTDFRSNKKIKLDRHNAGEFEKEAICFYKPLPLSS
LSANELTGLLFKNMAAADLAMLVLSGIGIVGVSLLIPFATKMVFEYVIPTGAMTLVGSFSFLLISSAMVAYIIAVIKQGY

You can put all of your accession numbers in a file and then feed that in to blastdbcmd (-entry_batch option). Doing this locally will be much faster than Entrezdirect but will require you to download the refseq_protein blast database index.

ADD COMMENT
2
Entering edit mode
2.8 years ago

Right, so there may be some manual work. If you have access to a Unix machine, how many entries do you get with

grep '>' my_fasta.fa | sed 's#\.[[:digit:]].*##' | uniq

edit: I was able to use the Eutils command-line tools from NCBI (https://www.ncbi.nlm.nih.gov/books/NBK179288/). But you need all your protein IDs in a sequence separated by a comma, which we can do.

Just get the installer, run the shell file to install, and move into that directory.

./install-edirect.sh \
cd edirect

Then we need your protein sequence IDs.

Here's my fasta:

cat tmp.fa
>WP_051684486.1
MSIFGEQFLARRNRDQIDLDNALQDVYEAVTGRESIRYSINSDEQVRKELERICFYLGVRADQDVPEYND
LEDMLDYITRPFAIMRRHILLTHHWWKNGDGPLLVSKKDSDELLALIPGRLGGYYYTDFRSNKKIKLDRH
NAGEFEKEAICFYKPLPLSSLSANELTGLLFKNMAAADLAMLVLSGIGIVGVSLLIPFATKMVFEYVIPT
GAMTLVGSFSFLLISSAMVAYIIAVIKQGYADRVKVRMEVYLTHGVMGRMINFPTSFFASKSTGELYRVF
DNLREIPQILIDSVIVPIIDISLAMLFIIQIAVIVPELLVPAVITVLLQFVCMAIGTFQAYGLLNIELQQ
DRKIQGLAISVYEGIQRIKLSGSESRIMAKWAGLYSKKAKVAYPAVFPVRFQTEMIAFISMMGMLAAFYK
GFTDNISISQFVAFVAAFGMLTGSITAFSNKSKDVIKLKPVLKMSDEILKECPEVSKEKLIVDHLSGKIE
VKDLTFRYGRDLPLILDGVSFTVHPGEYVAIVGKSGCGKSTLVRIFMGFEKAVSGSVSYDDIDVERIDPR
SLRRSIGVVMQSGNLFYDSIYRNIAISAPGLSMEEAWEAAEKAGIAEDIRNMPMKMKTLIPQGGGGISGG
QRQRIMIARALAAKPNILIFDEATSALDNITQKVVQDSLDQLNCTRIVIAHRLSTIQNCDRILVLDKGRI
IEEGNYQELLKKGGFFANLIKRQQL
>WP_013276004.1
MEVLKVSAKSNPNAVAGALAGVIREKGGAEIQIIGAGALNQAVKAIAIARGYVAPSGIDLICIPAFTDIE
IDGQQRTAIKLIVEPR

We can get the protein seq ids comma-separated.

grep '>' tmp.fa | sed 's#>##' | tr  '\n' ',' | sed 's#,$##' > protein_ids.csv
cat protein_ids.csv
WP_051684486.1,WP_013276004.1

Here you have a few options, but I guess one option is to just copy that string into the next command:

esearch -db protein -query WP_051684486.1,WP_013276004.1 | efetch -format fasta
>WP_051684486.1 ATP-binding cassette domain-containing protein [[Clostridium] aminophilum]
MSIFGEQFLARRNRDQIDLDNALQDVYEAVTGRESIRYSINSDEQVRKELERICFYLGVRADQDVPEYND
LEDMLDYITRPFAIMRRHILLTHHWWKNGDGPLLVSKKDSDELLALIPGRLGGYYYTDFRSNKKIKLDRH
NAGEFEKEAICFYKPLPLSSLSANELTGLLFKNMAAADLAMLVLSGIGIVGVSLLIPFATKMVFEYVIPT
GAMTLVGSFSFLLISSAMVAYIIAVIKQGYADRVKVRMEVYLTHGVMGRMINFPTSFFASKSTGELYRVF
DNLREIPQILIDSVIVPIIDISLAMLFIIQIAVIVPELLVPAVITVLLQFVCMAIGTFQAYGLLNIELQQ
DRKIQGLAISVYEGIQRIKLSGSESRIMAKWAGLYSKKAKVAYPAVFPVRFQTEMIAFISMMGMLAAFYK
GFTDNISISQFVAFVAAFGMLTGSITAFSNKSKDVIKLKPVLKMSDEILKECPEVSKEKLIVDHLSGKIE
VKDLTFRYGRDLPLILDGVSFTVHPGEYVAIVGKSGCGKSTLVRIFMGFEKAVSGSVSYDDIDVERIDPR
SLRRSIGVVMQSGNLFYDSIYRNIAISAPGLSMEEAWEAAEKAGIAEDIRNMPMKMKTLIPQGGGGISGG
QRQRIMIARALAAKPNILIFDEATSALDNITQKVVQDSLDQLNCTRIVIAHRLSTIQNCDRILVLDKGRI
IEEGNYQELLKKGGFFANLIKRQQL
>WP_013276004.1 MULTISPECIES: stage V sporulation protein S [Thermosediminibacter]
MEVLKVSAKSNPNAVAGALAGVIREKGGAEIQIIGAGALNQAVKAIAIARGYVAPSGIDLICIPAFTDIE
IDGQQRTAIKLIVEPR
ADD COMMENT
2
Entering edit mode

drew.b.ferrell : First part of your answer allows OP to get all accession numbers.

I suggest that since the accession numbers are now known there is not need to do a search. You can simply retrieve the sequence like this using EntrezDirect as you had suggested:

$ efetch -db protein -id WP_051684486.1 -format fasta
>WP_051684486.1 ATP-binding cassette domain-containing protein [[Clostridium] aminophilum]
MSIFGEQFLARRNRDQIDLDNALQDVYEAVTGRESIRYSINSDEQVRKELERICFYLGVRADQDVPEYND
LEDMLDYITRPFAIMRRHILLTHHWWKNGDGPLLVSKKDSDELLALIPGRLGGYYYTDFRSNKKIKLDRH
NAGEFEKEAICFYKPLPLSSLSANELTGLLFKNMAAADLAMLVLSGIGIVGVSLLIPFATKMVFEYVIPT
ADD REPLY
1
Entering edit mode
2.8 years ago
Jiyao Wang ▴ 370

You can search at NCBI protein database: https://www.ncbi.nlm.nih.gov/protein/WP_051684486.1

ADD COMMENT
1
Entering edit mode

Thanks for the reply! I have a lot of sequences ( 10000+ ) and it would be almost impossible to annotate them by hand. Is there an easy way to automate the task using NCBI tools or would I have to come up with a script by myself?

ADD REPLY

Login before adding your answer.

Traffic: 3212 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6