Question: Can't download some proteins using Entrez
0
gravatar for gabrielpoccia
7 weeks ago by
Federal University of ABC - Brazil
gabrielpoccia0 wrote:

I'm trying to download some protein sequences using Entrez through command line. However, altough 'esearch' command finds my proteins (searching by ID), the 'efetch' don't return anything. I've tried both gp and fasta formats.

Here are some example IDs:

GCB61038, GCB69151, GCC23899, GCC32047, MXQ86236

I have a list with 457 proteins that I can't download the sequences.

blast entrez genbank ncbi • 135 views
ADD COMMENTlink modified 7 weeks ago by Pierre Lindenbaum134k • written 7 weeks ago by gabrielpoccia0
3
gravatar for Pierre Lindenbaum
7 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum134k wrote:

works for me.

$ wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=GCB61038,GCB69151,GCC23899,GCC32047,MXQ86236&rettype=fasta"
>GCB61038.1 hypothetical protein scyTo_0012843 [Scyliorhinus torazame]
MSVRGRGAGLTAALLLLALLGDTVGGRGAGPQSQAQGQGRQFDVLNQLLTDYDILSLSDIHQHTVRKRDA
--
>GCB69151.1 hypothetical protein scyTo_0001005, partial [Scyliorhinus torazame]
KRKVSCDCREKAREGTGHFGNPLSKYIRHYEGLSYDTDMLHQKHQRAKRSILHDGQFVHLDFHAHGRHFN
--
>GCC23899.1 hypothetical protein chiPu_0002297 [Chiloscyllium punctatum]
MRLGLSFSQPLTSAAGFLYQDGGVGVGLQISVSGLTGVPAEAVAVGFPLPAGGGRRKSESGVNLDLDPGR
--
>GCC32047.1 hypothetical protein chiPu_0010507 [Chiloscyllium punctatum]
MRAPTMLLLGVGLLLIWASSLRGQLGNPLNKYIRHYEGLSYDTDVLHQKHQRAKRSILHDDQFVHLDFHA
--
>MXQ86236.1 hypothetical protein E5288_WYG006535 [Bos mutus]
MRAPLGRLGEGEGREEGVRLRLPVGRRLVTWTLHDESIFSQYGNPLNKYIRHYEGLSYDVDSLHQKHQRA
ADD COMMENTlink written 7 weeks ago by Pierre Lindenbaum134k

When using web API it looks like it automatically searches against ALL protein databases.

ADD REPLYlink written 7 weeks ago by GenoMax96k

Can you download the GP this way?

ADD REPLYlink written 7 weeks ago by gabrielpoccia0

Yes. Change the command to this:

wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=GCB61038,GCB69151,GCC23899,GCC32047,MXQ86236&rettype=gp"
ADD REPLYlink written 7 weeks ago by GenoMax96k

Oh, you saved my life! If you allow me, one last question. Some of my proteins are encoded by a whole genome CDs. In the GP file it appears as "/coded_by="join(WAAD01021866.1:<231469..231602, etc ". When I download this file (fasta) using the efetch, I get the whole genome splitted in the proteins it codes, will this work in the way you thought me?

ADD REPLYlink written 6 weeks ago by gabrielpoccia0
1
gravatar for GenoMax
7 weeks ago by
GenoMax96k
United States
GenoMax96k wrote:

Looks like these are from IPG database. Results truncated for space.

$ esearch -db ipg -query "GCB69151"  | efetch -format fasta
>GCB69151.1 hypothetical protein [Scyliorhinus torazame]
KRKVSCDCREKAREGTGHFGNPLSKYIRHYEGLSYDTDMLHQKHQRAKRSILHDGQFVHLDFHAHGRHFNLRMKRDTSIF
TDDFKMEVSGEELSYDTSHIYTGEIYGERGSLSHGSIVDGRFEGFVQTHQGTFYVEPVERYIENRKPPFHSVIYHEDDID

$ esearch -db ipg -query "MXQ86236"  | efetch -format fasta
>MXQ86236.1 hypothetical protein [Bos mutus]
MRAPLGRLGEGEGREEGVRLRLPVGRRLVTWTLHDESIFSQYGNPLNKYIRHYEGLSYDVDSLHQKHQRAKRAVSHEDQF
LRLDFHAHGRHFNLRMKRDTSLFSEEFRVETSNAVLDYDTSHIYTGHIYGEEGSFSHGSVIDGRFEGFIQTHGGTFYVEP
ADD COMMENTlink written 7 weeks ago by GenoMax96k

I noticed later that I could download the fasta using the IPG database, but what I really need is the GP file, because I'll need later the taxa information and the CDs IDs. But I thing this sequences do have some problems, I tried taking by hand some CDs IDs and tried to download the nucleotide Fasta, and It did not work, even tought the informations are in the database.

ADD REPLYlink written 7 weeks ago by gabrielpoccia0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1268 users visited in the last hour
_