Can't download some proteins using Entrez
2
0
Entering edit mode
3.3 years ago

I'm trying to download some protein sequences using Entrez through command line. However, altough 'esearch' command finds my proteins (searching by ID), the 'efetch' don't return anything. I've tried both gp and fasta formats.

Here are some example IDs:

GCB61038, GCB69151, GCC23899, GCC32047, MXQ86236

I have a list with 457 proteins that I can't download the sequences.

ncbi genbank entrez blast • 917 views
ADD COMMENT
3
Entering edit mode
3.3 years ago

works for me.

$ wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=GCB61038,GCB69151,GCC23899,GCC32047,MXQ86236&rettype=fasta"
>GCB61038.1 hypothetical protein scyTo_0012843 [Scyliorhinus torazame]
MSVRGRGAGLTAALLLLALLGDTVGGRGAGPQSQAQGQGRQFDVLNQLLTDYDILSLSDIHQHTVRKRDA
--
>GCB69151.1 hypothetical protein scyTo_0001005, partial [Scyliorhinus torazame]
KRKVSCDCREKAREGTGHFGNPLSKYIRHYEGLSYDTDMLHQKHQRAKRSILHDGQFVHLDFHAHGRHFN
--
>GCC23899.1 hypothetical protein chiPu_0002297 [Chiloscyllium punctatum]
MRLGLSFSQPLTSAAGFLYQDGGVGVGLQISVSGLTGVPAEAVAVGFPLPAGGGRRKSESGVNLDLDPGR
--
>GCC32047.1 hypothetical protein chiPu_0010507 [Chiloscyllium punctatum]
MRAPTMLLLGVGLLLIWASSLRGQLGNPLNKYIRHYEGLSYDTDVLHQKHQRAKRSILHDDQFVHLDFHA
--
>MXQ86236.1 hypothetical protein E5288_WYG006535 [Bos mutus]
MRAPLGRLGEGEGREEGVRLRLPVGRRLVTWTLHDESIFSQYGNPLNKYIRHYEGLSYDVDSLHQKHQRA
ADD COMMENT
0
Entering edit mode

When using web API it looks like it automatically searches against ALL protein databases.

ADD REPLY
0
Entering edit mode

Can you download the GP this way?

ADD REPLY
0
Entering edit mode

Oh, you saved my life! If you allow me, one last question. Some of my proteins are encoded by a whole genome CDs. In the GP file it appears as "/coded_by="join(WAAD01021866.1:<231469..231602, etc ". When I download this file (fasta) using the efetch, I get the whole genome splitted in the proteins it codes, will this work in the way you thought me?

ADD REPLY
1
Entering edit mode
3.3 years ago
GenoMax 141k

Looks like these are from IPG database. Results truncated for space.

$ esearch -db ipg -query "GCB69151"  | efetch -format fasta
>GCB69151.1 hypothetical protein [Scyliorhinus torazame]
KRKVSCDCREKAREGTGHFGNPLSKYIRHYEGLSYDTDMLHQKHQRAKRSILHDGQFVHLDFHAHGRHFNLRMKRDTSIF
TDDFKMEVSGEELSYDTSHIYTGEIYGERGSLSHGSIVDGRFEGFVQTHQGTFYVEPVERYIENRKPPFHSVIYHEDDID

$ esearch -db ipg -query "MXQ86236"  | efetch -format fasta
>MXQ86236.1 hypothetical protein [Bos mutus]
MRAPLGRLGEGEGREEGVRLRLPVGRRLVTWTLHDESIFSQYGNPLNKYIRHYEGLSYDVDSLHQKHQRAKRAVSHEDQF
LRLDFHAHGRHFNLRMKRDTSLFSEEFRVETSNAVLDYDTSHIYTGHIYGEEGSFSHGSVIDGRFEGFIQTHGGTFYVEP
ADD COMMENT
0
Entering edit mode

I noticed later that I could download the fasta using the IPG database, but what I really need is the GP file, because I'll need later the taxa information and the CDs IDs. But I thing this sequences do have some problems, I tried taking by hand some CDs IDs and tried to download the nucleotide Fasta, and It did not work, even tought the informations are in the database.

ADD REPLY

Login before adding your answer.

Traffic: 1487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6