How to retrive genomes of the isolates from specific regions? For example, If I want to retrive all the Escherichia genome fasta files from NCBI which are submitted from USA.
Entering edit mode
5 weeks ago
Jaykumar ▴ 40

I am beginning my work and was wondering how to do this.

USA coli Genome NCBI Escherichia • 523 views
Entering edit mode
4 weeks ago
MirianT_NCBI ▴ 310

Hi Jaykumar,

You can use NCBI Datasets command line tool for this task. You will also need jq to process the metadata files in JSON. Here are the steps:

  1. Using the datasets summary option, get a list of accessions and location from USA only:
datasets summary genome taxon 562 --as-json-lines |\
 grep -E "\"value\":\"USA" |\
 jq -r '.assembly_accession as $accs | .biosample.attributes[] 
| select(.name == "geo_loc_name") 
| select(.value | contains("USA")) 
| [$accs,.value] 
| @tsv'

Alternatively, if you only want the accession numbers, you can do this:

datasets summary genome taxon 562 | jq -r '.assemblies[].assembly 
| select((.biosample.attributes[].name == "geo_loc_name") and (.biosample.attributes[].value|contains("USA"))) 
| .assembly_accession' > ecoli_usa_accessions.txt
  1. Using the datasets download option, you can download only the genomes from USA based on the list we created.
datasets download genome accession \
   --inputfile ecoli_usa_accessions.txt \

This will download a data package with genomic sequences, as well as protein FASTA, CDS FASTA and GFF3, if they are available, plus metadata files. If you don't need all those files, you can exclude them like this:

datasets download genome accession \
   --inputfile ecoli_usa_accessions.txt \
   --exclude-genomic-cds --exclude-protein --exclude-gff3 \

Let me know if you have any other questions or run into any issues.

Entering edit mode

I forgot to mention one more thing: since you'll be downloading a lot of data and files, I would recommend you to use the --dehydrated flag option when downloading. Like this:

datasets download genome accession \
   --inputfile ecoli_usa_accessions.txt \
   --dehydrated \

This option will give you the metadata files and a txt with the paths to retrieve the data. Data retrieval will be faster and can be resumed if it fails. Here are the next steps:

  • Unzip the dehydrated package:

    unzip -d ecoli_usa
  • Rehydrate (aka retrieve/download) ALL data files:

    datasets rehydrate --directory ecoli_usa
  • As an alternative, you can retrieve only the genomic assembly files, like this:

    datasets rehydrate --directory ecoli_usa --match "GC.*/GC.*genomic.fna"

I hope it helps!

Entering edit mode

Thank you very much! It worked!!

Entering edit mode
5 weeks ago
GenoMax 118k

One way is to use EntrezDirect:

$  esearch -db assembly -query "562 [taxID]" | esummary | xtract -pattern DocumentSummary -element AssemblyAccession,BioSampleAccn,SubmitterOrganization,FtpPath_GenBank | head -10
    GCA_024134465.1 SAMN29473037    CDC
    GCA_024134005.1 SAMN29474218    CDC
    GCA_024133985.1 SAMN29473020    CDC
    GCA_024133965.1 SAMN29474283    CDC
    GCA_024133945.1 SAMN29473019    CDC
    GCA_024133825.1 SAMN29474253    Health Protection Agency
    GCA_024133685.1 SAMN29473230    CDC
    GCA_024133585.1 SAMN29474221    CDC
    GCA_024133405.1 SAMN29474277    CDC
    GCA_024133385.1 SAMN29473022    CDC

You can do more elaborate queries to check on the sample names in second column, if you can't parse entries you need from third column.


Login before adding your answer.

Traffic: 1780 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6