Question

Can't download some proteins using Entrez

0

Entering edit mode

4.5 years ago

gabrielpoccia • 0

I'm trying to download some protein sequences using Entrez through command line. However, altough 'esearch' command finds my proteins (searching by ID), the 'efetch' don't return anything. I've tried both gp and fasta formats.

Here are some example IDs:

GCB61038, GCB69151, GCC23899, GCC32047, MXQ86236

I have a list with 457 proteins that I can't download the sequences.

ncbi genbank entrez blast • 1.4k views

ADD COMMENT • link updated 4.5 years ago by Pierre Lindenbaum 166k • written 4.5 years ago by gabrielpoccia • 0

1

Entering edit mode

4.5 years ago

GenoMax 152k

Looks like these are from IPG database. Results truncated for space.

$ esearch -db ipg -query "GCB69151"  | efetch -format fasta
>GCB69151.1 hypothetical protein [Scyliorhinus torazame]
KRKVSCDCREKAREGTGHFGNPLSKYIRHYEGLSYDTDMLHQKHQRAKRSILHDGQFVHLDFHAHGRHFNLRMKRDTSIF
TDDFKMEVSGEELSYDTSHIYTGEIYGERGSLSHGSIVDGRFEGFVQTHQGTFYVEPVERYIENRKPPFHSVIYHEDDID

$ esearch -db ipg -query "MXQ86236"  | efetch -format fasta
>MXQ86236.1 hypothetical protein [Bos mutus]
MRAPLGRLGEGEGREEGVRLRLPVGRRLVTWTLHDESIFSQYGNPLNKYIRHYEGLSYDVDSLHQKHQRAKRAVSHEDQF
LRLDFHAHGRHFNLRMKRDTSLFSEEFRVETSNAVLDYDTSHIYTGHIYGEEGSFSHGSVIDGRFEGFIQTHGGTFYVEP

ADD COMMENT • link 4.5 years ago by GenoMax 152k

0

Entering edit mode

I noticed later that I could download the fasta using the IPG database, but what I really need is the GP file, because I'll need later the taxa information and the CDs IDs. But I thing this sequences do have some problems, I tried taking by hand some CDs IDs and tried to download the nucleotide Fasta, and It did not work, even tought the informations are in the database.

ADD REPLY • link 4.5 years ago by gabrielpoccia • 0

score 3 · Accepted Answer · 2021-01-15

3

Entering edit mode

4.5 years ago

Pierre Lindenbaum 166k

works for me.

$ wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=GCB61038,GCB69151,GCC23899,GCC32047,MXQ86236&rettype=fasta"
>GCB61038.1 hypothetical protein scyTo_0012843 [Scyliorhinus torazame]
MSVRGRGAGLTAALLLLALLGDTVGGRGAGPQSQAQGQGRQFDVLNQLLTDYDILSLSDIHQHTVRKRDA
--
>GCB69151.1 hypothetical protein scyTo_0001005, partial [Scyliorhinus torazame]
KRKVSCDCREKAREGTGHFGNPLSKYIRHYEGLSYDTDMLHQKHQRAKRSILHDGQFVHLDFHAHGRHFN
--
>GCC23899.1 hypothetical protein chiPu_0002297 [Chiloscyllium punctatum]
MRLGLSFSQPLTSAAGFLYQDGGVGVGLQISVSGLTGVPAEAVAVGFPLPAGGGRRKSESGVNLDLDPGR
--
>GCC32047.1 hypothetical protein chiPu_0010507 [Chiloscyllium punctatum]
MRAPTMLLLGVGLLLIWASSLRGQLGNPLNKYIRHYEGLSYDTDVLHQKHQRAKRSILHDDQFVHLDFHA
--
>MXQ86236.1 hypothetical protein E5288_WYG006535 [Bos mutus]
MRAPLGRLGEGEGREEGVRLRLPVGRRLVTWTLHDESIFSQYGNPLNKYIRHYEGLSYDVDSLHQKHQRA

ADD COMMENT • link 4.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

When using web API it looks like it automatically searches against ALL protein databases.

ADD REPLY • link 4.5 years ago by GenoMax 152k

0

Entering edit mode

Can you download the GP this way?

ADD REPLY • link 4.5 years ago by gabrielpoccia • 0

0

Entering edit mode

Yes. Change the command to this:

wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=GCB61038,GCB69151,GCC23899,GCC32047,MXQ86236&rettype=gp"

ADD REPLY • link 4.5 years ago by GenoMax 152k

0

Entering edit mode

Oh, you saved my life! If you allow me, one last question. Some of my proteins are encoded by a whole genome CDs. In the GP file it appears as "/coded_by="join(WAAD01021866.1:<231469..231602, etc ". When I download this file (fasta) using the efetch, I get the whole genome splitted in the proteins it codes, will this work in the way you thought me?

ADD REPLY • link 4.5 years ago by gabrielpoccia • 0