Download list of acc from ncbi
1
0
Entering edit mode
6.9 years ago
jerome • 0

Dear all

For some study i need to download from NCBI a list of ACC from a taxonomy group. I need to get the nucleotides and protein acc list. I can download this list from the web site of NCBI, using the "Send to" button and following with the good selection of the data. As i have to do for more than 2 taxonomy references, i'll use esearch and efetch, like this:

esearch -db nuccore -query "txid9606[Organism:exp]" | efetch -format=acc -mode text > list-acc.txt

The problem is that when using the navigator let me download a 6 millions lines file in less tahn 20 minutes, the same need more than hours with the command line. I'm wrong with my command line? O there is no other option to automate this process?

Regards

sequence • 2.1k views
ADD COMMENT
0
Entering edit mode

Dear Istvan,

Thank's for answer. I run first the esearch command to have the number of ACC list. As i wrtoe in my question, i'll do for more than one taxonomy number. I used the human example, but i need for other categories. I understand you remark about abuse... But that a mess to use a navigator to download more than 20 lists. But that's the way is running quick. Regards.

ADD REPLY
0
Entering edit mode

This comment needs to go under @Istvan's answer above.

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is reserved for new answers for the original question.

ADD REPLY
0
Entering edit mode
6.9 years ago

Running just

esearch -db nuccore -query "txid9606[Organism:exp]"

will tell you how many items will the resultset contain. In this case 14,781,448:

<ENTREZ_DIRECT>
  <Db>nuccore</Db>
  <WebEnv>NCID_1_37294073_130.14.18.34_9001_1499884889_1228809081_0MetA0_S_MegaStore_F_1</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>14781448</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

In theory, the command line download should be the fastest and most efficient - but in practice who knows how it is being implemented. It is possible that they throttle the command line a little more since it is a lot easier to abuse.

I will also note that you are downloading all the accession numbers for the human genome - there may be simpler ways to get that. There are prepared files on the NCBI website that may already contain what you need:

ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/

ADD COMMENT
0
Entering edit mode

Dear Istvan,

Thank's for answer. I run first the esearch command to have the number of ACC list. As i wrtoe in my question, i'll do for more than one taxonomy number. I used the human example, but i need for other categories. I understand you remark about abuse... But that a mess to use a navigator to download more than 20 lists. But that's the way is running quick. Regards.

ADD REPLY
0
Entering edit mode

Another solution would be to get the nr blast databases and extract and format the content with blastdbcmd. In that case there a single download and is only a matter of formatting.

ADD REPLY

Login before adding your answer.

Traffic: 1309 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6