How to retrive genomes of the isolates from specific regions? For example, If I want to retrive all the Escherichia genome fasta files from NCBI which are submitted from USA.
5 weeks ago
Jaykumar

I am beginning my work and was wondering how to do this.

4 weeks ago
MirianT_NCBI

Hi Jaykumar,

You can use NCBI Datasets command line tool for this task. You will also need jq to process the metadata files in JSON. Here are the steps:

  1. Using the datasets summary option, get a list of accessions and location from USA only:
datasets summary genome taxon 562 --as-json-lines |\
 grep -E "\"value\":\"USA" |\
 jq -r '.assembly_accession as $accs | .biosample.attributes[] 
| select(.name == "geo_loc_name") 
| select(.value | contains("USA")) 
| [$accs,.value] 
| @tsv'

Alternatively, if you only want the accession numbers, you can do this:

datasets summary genome taxon 562 | jq -r '.assemblies[].assembly 
| select((.biosample.attributes[].name == "geo_loc_name") and (.biosample.attributes[].value|contains("USA"))) 
| .assembly_accession' > ecoli_usa_accessions.txt
  1. Using the datasets download option, you can download only the genomes from USA based on the list we created.
datasets download genome accession \
   --inputfile ecoli_usa_accessions.txt \

This will download a data package with genomic sequences, as well as protein FASTA, CDS FASTA and GFF3, if they are available, plus metadata files. If you don't need all those files, you can exclude them like this:

datasets download genome accession \
   --inputfile ecoli_usa_accessions.txt \
   --exclude-genomic-cds --exclude-protein --exclude-gff3 \

Let me know if you have any other questions or run into any issues.

I forgot to mention one more thing: since you'll be downloading a lot of data and files, I would recommend you to use the --dehydrated flag option when downloading. Like this:

datasets download genome accession \
   --inputfile ecoli_usa_accessions.txt \
   --dehydrated \

This option will give you the metadata files and a txt with the paths to retrieve the data. Data retrieval will be faster and can be resumed if it fails. Here are the next steps:

  • Unzip the dehydrated package:

    unzip -d ecoli_usa
  • Rehydrate (aka retrieve/download) ALL data files:

    datasets rehydrate --directory ecoli_usa
  • As an alternative, you can retrieve only the genomic assembly files, like this:

    datasets rehydrate --directory ecoli_usa --match "GC.*/GC.*genomic.fna"

I hope it helps!

Thank you very much! It worked!!

5 weeks ago
GenoMax

One way is to use EntrezDirect:

$  esearch -db assembly -query "562 [taxID]" | esummary | xtract -pattern DocumentSummary -element AssemblyAccession,BioSampleAccn,SubmitterOrganization,FtpPath_GenBank | head -10
    GCA_024134465.1 SAMN29473037    CDC
    GCA_024134005.1 SAMN29474218    CDC
    GCA_024133985.1 SAMN29473020    CDC
    GCA_024133965.1 SAMN29474283    CDC
    GCA_024133945.1 SAMN29473019    CDC
    GCA_024133825.1 SAMN29474253    Health Protection Agency
    GCA_024133685.1 SAMN29473230    CDC
    GCA_024133585.1 SAMN29474221    CDC
    GCA_024133405.1 SAMN29474277    CDC
    GCA_024133385.1 SAMN29473022    CDC

You can do more elaborate queries to check on the sample names in second column, if you can't parse entries you need from third column.


