Question: Download list of acc from ncbi
0
gravatar for jerome
2.3 years ago by
jerome0
jerome0 wrote:

Dear all

For some study i need to download from NCBI a list of ACC from a taxonomy group. I need to get the nucleotides and protein acc list. I can download this list from the web site of NCBI, using the "Send to" button and following with the good selection of the data. As i have to do for more than 2 taxonomy references, i'll use esearch and efetch, like this:

esearch -db nuccore -query "txid9606[Organism:exp]" | efetch -format=acc -mode text > list-acc.txt

The problem is that when using the navigator let me download a 6 millions lines file in less tahn 20 minutes, the same need more than hours with the command line. I'm wrong with my command line? O there is no other option to automate this process?

Regards

sequence • 943 views
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by jerome0

Dear Istvan,

Thank's for answer. I run first the esearch command to have the number of ACC list. As i wrtoe in my question, i'll do for more than one taxonomy number. I used the human example, but i need for other categories. I understand you remark about abuse... But that a mess to use a navigator to download more than 20 lists. But that's the way is running quick. Regards.

ADD REPLYlink written 2.3 years ago by jerome0

This comment needs to go under @Istvan's answer above.

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is reserved for new answers for the original question.

ADD REPLYlink written 2.3 years ago by genomax74k
0
gravatar for Istvan Albert
2.3 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

Running just

esearch -db nuccore -query "txid9606[Organism:exp]"

will tell you how many items will the resultset contain. In this case 14,781,448:

<ENTREZ_DIRECT>
  <Db>nuccore</Db>
  <WebEnv>NCID_1_37294073_130.14.18.34_9001_1499884889_1228809081_0MetA0_S_MegaStore_F_1</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>14781448</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

In theory, the command line download should be the fastest and most efficient - but in practice who knows how it is being implemented. It is possible that they throttle the command line a little more since it is a lot easier to abuse.

I will also note that you are downloading all the accession numbers for the human genome - there may be simpler ways to get that. There are prepared files on the NCBI website that may already contain what you need:

ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Istvan Albert ♦♦ 81k

Dear Istvan,

Thank's for answer. I run first the esearch command to have the number of ACC list. As i wrtoe in my question, i'll do for more than one taxonomy number. I used the human example, but i need for other categories. I understand you remark about abuse... But that a mess to use a navigator to download more than 20 lists. But that's the way is running quick. Regards.

ADD REPLYlink written 2.3 years ago by jerome0

Another solution would be to get the nr blast databases and extract and format the content with blastdbcmd. In that case there a single download and is only a matter of formatting.

ADD REPLYlink written 2.3 years ago by Istvan Albert ♦♦ 81k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1838 users visited in the last hour