Question

Download list of acc from ncbi

0

Entering edit mode

6.9 years ago

jerome • 0

Dear all

For some study i need to download from NCBI a list of ACC from a taxonomy group. I need to get the nucleotides and protein acc list. I can download this list from the web site of NCBI, using the "Send to" button and following with the good selection of the data. As i have to do for more than 2 taxonomy references, i'll use esearch and efetch, like this:

esearch -db nuccore -query "txid9606[Organism:exp]" | efetch -format=acc -mode text > list-acc.txt

The problem is that when using the navigator let me download a 6 millions lines file in less tahn 20 minutes, the same need more than hours with the command line. I'm wrong with my command line? O there is no other option to automate this process?

Regards

sequence • 2.1k views

ADD COMMENT • link 6.9 years ago by jerome • 0

0

Entering edit mode

Dear Istvan,

Thank's for answer. I run first the esearch command to have the number of ACC list. As i wrtoe in my question, i'll do for more than one taxonomy number. I used the human example, but i need for other categories. I understand you remark about abuse... But that a mess to use a navigator to download more than 20 lists. But that's the way is running quick. Regards.

ADD REPLY • link 6.9 years ago by jerome • 0

0

Entering edit mode

This comment needs to go under @Istvan's answer above.

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is reserved for new answers for the original question.

ADD REPLY • link 6.9 years ago by GenoMax 142k

score 0 · Answer 1 · 2017-07-12

0

Entering edit mode

6.9 years ago

Istvan Albert 100k

Running just

esearch -db nuccore -query "txid9606[Organism:exp]"

will tell you how many items will the resultset contain. In this case 14,781,448:

<ENTREZ_DIRECT>
  <Db>nuccore</Db>
  <WebEnv>NCID_1_37294073_130.14.18.34_9001_1499884889_1228809081_0MetA0_S_MegaStore_F_1</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>14781448</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

In theory, the command line download should be the fastest and most efficient - but in practice who knows how it is being implemented. It is possible that they throttle the command line a little more since it is a lot easier to abuse.

I will also note that you are downloading all the accession numbers for the human genome - there may be simpler ways to get that. There are prepared files on the NCBI website that may already contain what you need:

ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/

ADD COMMENT • link 6.9 years ago by Istvan Albert 100k

0

Entering edit mode

Dear Istvan,

Thank's for answer. I run first the esearch command to have the number of ACC list. As i wrtoe in my question, i'll do for more than one taxonomy number. I used the human example, but i need for other categories. I understand you remark about abuse... But that a mess to use a navigator to download more than 20 lists. But that's the way is running quick. Regards.

ADD REPLY • link 6.9 years ago by jerome • 0

0

Entering edit mode

Another solution would be to get the nr blast databases and extract and format the content with blastdbcmd. In that case there a single download and is only a matter of formatting.

ADD REPLY • link 6.9 years ago by Istvan Albert 100k