How to get the name of a protein by knowing only the RefSeq accession number?
11 months ago
Alex ▴ 10

Hi everyone! I have a fasta file with amino acid sequences which have only the RefSeq accession number (e.g. WP_ + 9 digits) and I'm trying to get the name of the proteins so that I can add them to the fasta ids. Here is an example:

>WP_051684486.1
MSIFGEQFLARRNRDQIDLDNALQDVYEAVTGRESIRYSINSDEQVRKELERICFYLGVR
MLVLSGIGIVGVSLLIPFATKMVFEYVIPTGAMTLVGSFSFLLISSAMVAYIIAVIKQGY
ISLAMLFIIQIAVIVPELLVPAVITVLLQFVCMAIGTFQAYGLLNIELQQDRKIQGLAIS
VYEGIQRIKLSGSESRIMAKWAGLYSKKAKVAYPAVFPVRFQTEMIAFISMMGMLAAFYK
GFTDNISISQFVAFVAAFGMLTGSITAFSNKSKDVIKLKPVLKMSDEILKECPEVSKEKL
IVDHLSGKIEVKDLTFRYGRDLPLILDGVSFTVHPGEYVAIVGKSGCGKSTLVRIFMGFE
KAVSGSVSYDDIDVERIDPRSLRRSIGVVMQSGNLFYDSIYRNIAISAPGLSMEEAWEAA
EKAGIAEDIRNMPMKMKTLIPQGGGGISGGQRQRIMIARALAAKPNILIFDEATSALDNI
TQKVVQDSLDQLNCTRIVIAHRLSTIQNCDRILVLDKGRIIEEGNYQELLKKGGFFANLI
KRQQL


On the NCBI RefSeq site, this maps to "ATP-binding cassette domain-containing protein", so I want to add that to the identifier in order to get:

>WP_051684486.1|ATP-binding cassette domain-containing protein
I haven't used RefSeq before. Is there a way to get all RefSeq definitions as a file? If so, you can use some basic text processing in Unix to map to your .fa.

I know that every RefSeq accession has a Identical Protein Groups page on NCBI ( in the case above it's https://www.ncbi.nlm.nih.gov/ipg/WP_051684486.1 ) where I can see the protein's annotation and download a csv/fasta file with the annotated sequence, but I honestly don't know if there is a way to get all the RefSeq definitions as a file.

Are you dealing only with C.aminophilum? Also: https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#allcomplete

There are protein sequences from all types of prokaryotic organisms. I have tried querying IPG myself using the search string my other colleagues used, and I was given the sequences fully annotated as I expected, so I guess it was an error or some kind of preprocessing from their part!

You can use Entrez eutils to do that. I think you are limited to 3 queries per second.

11 months ago
GenoMax 117k

@drew.b.ferrell has already demonstrated one way of doing this. If all of your sequences are from "RefSeq Protein" then you could download "refseq_protein" blast database indexes from NCBI via this page. Use the method above from drew.b.ferrell to get accession numbers.

Then you can use blastdbcmd utility from blast+ package to retrieve sequences you need (sequence truncated).

$blastdbcmd -db refseq_protein -entry WP_051684486.1 -outfmt %f >WP_051684486.1 ATP-binding cassette domain-containing protein [[Clostridium] aminophilum] MSIFGEQFLARRNRDQIDLDNALQDVYEAVTGRESIRYSINSDEQVRKELERICFYLGVRADQDVPEYNDLEDMLDYITR PFAIMRRHILLTHHWWKNGDGPLLVSKKDSDELLALIPGRLGGYYYTDFRSNKKIKLDRHNAGEFEKEAICFYKPLPLSS LSANELTGLLFKNMAAADLAMLVLSGIGIVGVSLLIPFATKMVFEYVIPTGAMTLVGSFSFLLISSAMVAYIIAVIKQGY  You can put all of your accession numbers in a file and then feed that in to blastdbcmd (-entry_batch option). Doing this locally will be much faster than Entrezdirect but will require you to download the refseq_protein blast database index. ADD COMMENT 2 Entering edit mode 11 months ago Right, so there may be some manual work. If you have access to a Unix machine, how many entries do you get with grep '>' my_fasta.fa | sed 's#\.[[:digit:]].*##' | uniq  edit: I was able to use the Eutils command-line tools from NCBI (https://www.ncbi.nlm.nih.gov/books/NBK179288/). But you need all your protein IDs in a sequence separated by a comma, which we can do. Just get the installer, run the shell file to install, and move into that directory. ./install-edirect.sh \ cd edirect  Then we need your protein sequence IDs. Here's my fasta: cat tmp.fa >WP_051684486.1 MSIFGEQFLARRNRDQIDLDNALQDVYEAVTGRESIRYSINSDEQVRKELERICFYLGVRADQDVPEYND LEDMLDYITRPFAIMRRHILLTHHWWKNGDGPLLVSKKDSDELLALIPGRLGGYYYTDFRSNKKIKLDRH NAGEFEKEAICFYKPLPLSSLSANELTGLLFKNMAAADLAMLVLSGIGIVGVSLLIPFATKMVFEYVIPT GAMTLVGSFSFLLISSAMVAYIIAVIKQGYADRVKVRMEVYLTHGVMGRMINFPTSFFASKSTGELYRVF DNLREIPQILIDSVIVPIIDISLAMLFIIQIAVIVPELLVPAVITVLLQFVCMAIGTFQAYGLLNIELQQ DRKIQGLAISVYEGIQRIKLSGSESRIMAKWAGLYSKKAKVAYPAVFPVRFQTEMIAFISMMGMLAAFYK GFTDNISISQFVAFVAAFGMLTGSITAFSNKSKDVIKLKPVLKMSDEILKECPEVSKEKLIVDHLSGKIE VKDLTFRYGRDLPLILDGVSFTVHPGEYVAIVGKSGCGKSTLVRIFMGFEKAVSGSVSYDDIDVERIDPR SLRRSIGVVMQSGNLFYDSIYRNIAISAPGLSMEEAWEAAEKAGIAEDIRNMPMKMKTLIPQGGGGISGG QRQRIMIARALAAKPNILIFDEATSALDNITQKVVQDSLDQLNCTRIVIAHRLSTIQNCDRILVLDKGRI IEEGNYQELLKKGGFFANLIKRQQL >WP_013276004.1 MEVLKVSAKSNPNAVAGALAGVIREKGGAEIQIIGAGALNQAVKAIAIARGYVAPSGIDLICIPAFTDIE IDGQQRTAIKLIVEPR  We can get the protein seq ids comma-separated. grep '>' tmp.fa | sed 's#>##' | tr '\n' ',' | sed 's#,$##' > protein_ids.csv

cat protein_ids.csv
WP_051684486.1,WP_013276004.1


Here you have a few options, but I guess one option is to just copy that string into the next command:

esearch -db protein -query WP_051684486.1,WP_013276004.1 | efetch -format fasta
>WP_051684486.1 ATP-binding cassette domain-containing protein [[Clostridium] aminophilum]
>WP_013276004.1 MULTISPECIES: stage V sporulation protein S [Thermosediminibacter]
MEVLKVSAKSNPNAVAGALAGVIREKGGAEIQIIGAGALNQAVKAIAIARGYVAPSGIDLICIPAFTDIE
IDGQQRTAIKLIVEPR

drew.b.ferrell : First part of your answer allows OP to get all accession numbers.

I suggest that since the accession numbers are now known there is not need to do a search. You can simply retrieve the sequence like this using EntrezDirect as you had suggested:

\$ efetch -db protein -id WP_051684486.1 -format fasta
>WP_051684486.1 ATP-binding cassette domain-containing protein [[Clostridium] aminophilum]
LEDMLDYITRPFAIMRRHILLTHHWWKNGDGPLLVSKKDSDELLALIPGRLGGYYYTDFRSNKKIKLDRH

11 months ago
Jiyao Wang ▴ 210

You can search at NCBI protein database: https://www.ncbi.nlm.nih.gov/protein/WP_051684486.1

Thanks for the reply! I have a lot of sequences ( 10000+ ) and it would be almost impossible to annotate them by hand. Is there an easy way to automate the task using NCBI tools or would I have to come up with a script by myself?