Question

Download NCBI genomes using taxonomy id (taxid)

4

Entering edit mode

7.1 years ago

David ▴ 230

Hi, Is there a quick way to download bacterial and archaea genomes from ncbi using a list of taxid ??? (got them from the GOLD database). Mainly complete genomes, scaffold and contigs.

Thanks,

ncbi python efetch biopython • 7.6k views

ADD COMMENT • link updated 2.6 years ago by Michael 54k • written 7.1 years ago by David ▴ 230

2

Entering edit mode

Get this file. Look up the taxid you need. The last column of the file has the directory which has the ftp location of the genome assembly. Get the .fna file from there.

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

That's cool~ thank you

ADD REPLY • link 7.1 years ago by shenwei356 8.4k

0

Entering edit mode

Hi i have noticed that the viral database viral database assembly file does only contain 3 viruses ???

ADD REPLY • link 7.1 years ago by David ▴ 230

0

Entering edit mode

Viral genomes are here.

RefSeq viral genomes are in this file.

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

Dear Michael, Thank you for providing a script for this task, I really appreciate! I am trying to run your script, but I am getting the following error message:

getting genome for: 7088
authoritative genome: GCA_002245475.1 GCA_002245505.1 GCA_002192655.1 GCA_900068825.1 GCA_900068815.1 GCA_900068475.1 GCA_900068455.1 GCA_900068225.1 GCA_900068335.1 GCA_900067975.1 GCA_001625245.1 GCA_001586405.1 GCA_001486225.1 GCA_900068365.1 GCA_001485985.1 GCA_900182495.1 GCA_001278395.1 GCA_001266575.1 GCA_001186105.1 GCA_002197625.1 GCA_000931545.1 GCA_001485965.1 GCA_001486065.1 GCA_900068735.1 GCA_000836215.1 GCA_001486185.1 GCA_001298355.1 GCA_000836235.1 GCA_002156985.1 GCA_000636095.1 GCA_900068715.1 GCA_002150865.1 GCA_000262585.1 GCA_000235995.2 GCA_000330985.1 GCA_002213285.1 GCA_000313835.2 GCA_001485745.1 GCA_001856805.1 GCA_000716385.1 GCA_000151625.1 Cvi_v1 Cne_v1 ASM219265v1 Heliconius_xanthocles.CAM009106 Heliconius_wallacei.JM-04-200 Heliconius_hierax.CAM008149 Heliconius_hecuba_flava.CAM008550 Heliconius_aoede.JM-09-347 Heliconius_doris_doris.CAM008684 Heliconius_heurippa.CH9 3306_assembly_v2 3314_assembly_v1 Heliconius_pardalinus_sergestus.09-202 Heliconius_elevatus_bari.MJ09-4056 Heliconius_ethilla_aerotome.09-67 Plodia_dt lac_assembly_V1 ASM126657v1 ASM118610v1 ASM219762v1 pgl_assembly_v1 Heliconius_ismenius.STRI_003 Heliconius_hecale_felix.09-273 Heliconius_pachinus.CAM008035 Ppol_1.0 Heliconius_timareta_thelxinoe.8624 Pap_ma_1.0 Pxut_1.0 Harm_1.0 CsuOGS1.0 Heliconius_numata_bicoloratus.JM-05-1116 Hzea_1.0 Msex_1.0 Dpv3 DBM_FJ_V1.1 ASM221328v1 ASM31383v2 Heliconius_cydno_cordula.M2258 P_rapae_3842_assembly_v2 MelCinx1.0 ASM15162v1

FtpPath:
basename: missing operand
Try 'basename --help' for more information.

I am running this at a computer cluster. I was wondering if you could help figure out what is going wrong. Thank you in advance!

ADD REPLY • link updated 6.7 years ago by GenoMax 141k • written 6.7 years ago by pedrorodrigues ▴ 10

1

Entering edit mode

For future reference, please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. Your response should have directly gone under @Michael's answer.

As for your question, do you have NCBI eUtils installed and in your path? What do you get when you issue this command which xtract?

ADD REPLY • link 6.7 years ago by GenoMax 141k

1

Entering edit mode

Thank you, genomax! I am new to biostars and I appreciate your comment. I was indeed not sure how to get my message posted in the right place.

Regarding my question, I am using NCBI eUtils, but I wasn't able to add the utilities to my path (.bashrc). So I just added the eUtils folder path in every line where Michael's script called for them, like in this example:

EUTILS=../e-utils/edirect
set -u
for TAX in "${TAXLIST[@]}" ; do
echo getting genome for: $TAX
   GENOME=$($EUTILS/esearch -db genome -query txid${TAX}[Organism:exp] |

I also added #!/bin/bash to the script. I think the issue is that the script is not finding the FtpPath variable, because perhaps it is missing in the records that are extracted at this step:

FTPP=`echo $RESULT | xtract -pattern DocumentSummary  -element FtpPath_GenBank`

ADD REPLY • link 6.7 years ago by pedrorodrigues ▴ 10

0

Entering edit mode

Can you run the the esearch command with an example accession number from above to see if you get anything back?

ADD REPLY • link 6.7 years ago by GenoMax 141k

0

Entering edit mode

Hi genomax! I ran esearch with the first accession number from above and I got the following result:

e-utils/edirect/esearch -db genome -query GCA_002245475.1       

> <ENTREZ_DIRECT>   <Db>genome</Db>  
> <WebEnv>NCID_1_268926074_130.14.22.215_9001_1502840143_1703591267_0MetA0_S_MegaStore_F_1</WebEnv>   <QueryKey>1</QueryKey>   <Count>1</Count>   <Step>1</Step>
> </ENTREZ_DIRECT>

ADD REPLY • link 6.7 years ago by pedrorodrigues ▴ 10

0

Entering edit mode

I think the problem is that your query

esearch -db genome -query txid7088[Organism:exp] | efetch -format docsum | xtract -pattern DocumentSummary  -element Assembly_Accession

is returning multiple genome accession #, where as @Michael's script is expecting only one.

The search works fine with one accession.

   $ esearch -db assembly -query GCA_002245475.1 | efetch -format docsum | xtract -pattern DocumentSummary  -element FtpPath_GenBank

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/245/475/GCA_002245475.1_Cvi_v1

$ FTPP=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/245/475/GCA_002245475.1_Cvi_v1

$ BASENAME=`basename $FTPP`
$ echo $BASENAME
GCA_002245475.1_Cvi_v1
$ FTPPATHG=$FTPP/$BASENAME'_genomic.fna.gz'
$ echo $FTPPATHG
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/245/475/GCA_002245475.1_Cvi_v1/GCA_002245475.1_Cvi_v1_genomic.fna.gz

Also you can do export PATH=$PATH:/path_to/e-utils/edirect to get the eUtils in your $PATH (replace path_to with real path).

ADD REPLY • link 6.7 years ago by GenoMax 141k

0

Entering edit mode

I figured out my mistake: I am using a taxID of an order of insects (Lepidoptera = 7088), and therefore the script can't find an unique FTP address for all the different genomes. I am going to make a list of taxon ids for the specific genomes that I need and use it as input for the script. Thank you, genomax, for all the help!

ADD REPLY • link 6.7 years ago by pedrorodrigues ▴ 10

score 2 · Answer 1 · 2017-03-29

So here is a shell script I have used, that you could adapt for your needs. In the current form it downloads genome sequence and protein sequences, makes a blast database out of them, then joins all blastdbs in the directory into one alias db, called "SelectedArthropods". It works - mostly - and when it breaks it breaks hard, there is no error checking really :) You need

NCBI eutils
wget in your path

usage: fetchAllGenomesByTaxon.sh "Alcanivorax borkumensis"

The genome is downloaded to the current directory. copies of the esearch results are saved as

Alcanivorax borkumensis.assembly.esearch.docsum,   
Alcanivorax borkumensis.genome.esearch.docsum

in the current directory for your reference.

#!/bin/sh

set -u #exit on onbound variable
#TAXLIST=("Daphnia pulex" "Drosophila melanogaster" "Anopheles gambiae" "Pediculus humanus"
#"Ixodes scapularis" "Apis mellifera"  "Bombyx mori")
#TAXLIST=("Strigamia maritima")
TAXLIST=$@ # provide multiple taxa on the cmd-line
for TAX in "${TAXLIST[@]}" ; do
echo getting genome for: $TAX
   GENOME=$(esearch -db genome -query "$TAX"[orgn] |
       efetch -format docsum | tee "${TAX}.genome.esearch.docsum")
   ACC=`echo $GENOME | xtract -pattern DocumentSummary  -element Assembly_Accession`
   NAME=`echo $GENOME |  xtract -pattern DocumentSummary -element Assembly_Name`
   echo authoritative genome: $ACC $NAME
   RESULT=$(esearch -db assembly -query "$ACC" |
       efetch -format docsum | tee "${TAX}.assembly.esearch.docsum")
   FTPP=`echo $RESULT | xtract -pattern DocumentSummary  -element FtpPath_GenBank`
   TAXID=`echo $RESULT | xtract -pattern DocumentSummary  -element Taxid`
   echo FtpPath: $FTPP
   BASENAME=`basename $FTPP`
   FTPPATHG=$FTPP/$BASENAME'_genomic.fna.gz'
   FTPPATHP=$FTPP/$BASENAME'_protein.faa.gz' 
   echo Downloading $FTPPATHG ...

   ## get genome data
   wget $FTPPATHG
   BASENAME=`basename $FTPPATHG`
   gunzip -f $BASENAME
   BASENAME=`echo $BASENAME | sed s/.gz//`
   makeblastdb -in $BASENAME -dbtype nucl -parse_seqids -taxid $TAXID -title "$TAX $NAME genomic"
   echo Downloading $FTPPATHP ...
   ## get protein data
   wget $FTPPATHP  # this may throw an error if there is no proteome file
   BASENAME=`basename $FTPPATHP`
   gunzip -f $BASENAME
   BASENAME=`echo $BASENAME | sed s/.gz//`
   makeblastdb -in $BASENAME -dbtype prot -parse_seqids -taxid $TAXID -title "$TAX $NAME proteins"


done

### Make a blast db of all fasta files in directory
ls  *.fna > dblist_nuc.txt
blastdb_aliastool -dblist_file dblist_nuc.txt -out SelectedArthropds -title 'selection of arth genomes' -dbtype nucl
ls *.faa > dblist_prot.txt
blastdb_aliastool -dblist_file dblist_prot.txt -out SelectedArthropds -title 'selection of arth proteins' -dbtype prot