Downloading Batch Entrez Protien Information
1
0
Entering edit mode
20 months ago
Andrew • 0

I am trying to download a large amount of NCBI entries from a large set of accession numbers returned by a blastp search. With my list of accession numbers, I am looking for their taxonomy, bacteria, or virus, which is available when using batch entrez in protein search and downloading the returned list in GenPept format. Where I am having an issue is that some of these numbers are flagged by batch entrez as, Id=MBG9901843.1: protein: Wrong UID MBG9901843.1 which is about 230 out of a 2500 set of accession numbers. Those that are flagged in this manner do get returned when using the Identical Protein Group entrez option, however, I would have to switch the format to Genpept one by one and download them one by one. Is there a simpler way to obtain the taxonomy of an accession number from a list of accessions? I would simply ignore those that get flagged as the Wrong UID however it cuts out too much of the data for my liking. I know this is a bit of a long-winded explanation and I'll be glad to clarify any aspect.

Entrez data NCBI Batch • 1.5k views
ADD COMMENT
0
Entering edit mode

Thanks you very mmuch

ADD REPLY
1
Entering edit mode
20 months ago
GenoMax 142k

See: NCBI Accession Number to Taxonomy ID

Using EntrezDirect:

$ esummary -db ipg -id MBG9901843 | xtract -pattern DocumentSummary -element Caption,TaxId,Organism,Div
MBG9901843      293387  Bacillus altitudinis    BCT

For more than one ID, put them in a file, one per line (here id):

$ more id
MBG9901560
MBG9901570
MBG9901580
MBG9901843

$ for i in `cat id`; do esummary -db ipg -id ${i} | xtract -pattern DocumentSummary -element Caption,TaxId,Organism,Div; done
WP_008343464    1386    Bacillus        BCT
WP_007497670    2       Bacteria        BCT
WP_017360045    1386    Bacillus        BCT
MBG9901843      293387  Bacillus altitudinis    BCT
ADD COMMENT
0
Entering edit mode

Great, thank you so much!

ADD REPLY
0
Entering edit mode

Have there been any known issues with the files transmute.Linux.gz, xtract.Linux.gz, and rchive.Linux.gz on download of EntrezDirect? I seem to have at least one of these become corrputed and not unzipped on each install.

ADD REPLY
0
Entering edit mode

If you are on macOS:

Download these files manually from the NCBI EntrezDirect ftp site: http://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/

Extract files: gunzip -f *.gz

Change permission using chmod +x for all 3 files.

ADD REPLY

Login before adding your answer.

Traffic: 1886 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6