Automated way to pull representative/RefSeq genome using full species name?
2
0
Entering edit mode
9 weeks ago
predeus ★ 1.8k

Hi all,

I was wondering if there's a streamlined way of getting the "representative" genome for a particular species. What I'm looking for is an automated way of retrieving a genome assembly and annotation for all bacterial species in, e.g. Kraken2 output. The most popular databases for Kraken2 are made from RefSeq, so I'd imagine there should be a relatively easy way to match Taxonomy ID/species name to a RefSeq ID.

Any advice would be appreciated, as always!

All the best

-- Alex

ncbi taxonomy refseq • 715 views
ADD COMMENT
1
Entering edit mode
8 weeks ago
MirianT_NCBI ▴ 390

Hi Alex,
You can use NCBI Datasets. datasets allows users to download genome and associated annotation files as a data package. In this case, since it's a pretty large download (all bacterial genomes), we recommend downloading data as a "dehydrated" package. You can read more about it in this How-to guide: Download large genome data packages.

For all bacteria, the steps would be:

  1. Download a dehydrated data package including genome and protein sequences. Here you need to decide if you want representative + reference genomes (--reference flag) or Refseq only (--assembly-source refseq). I'm using the --reference flag to demonstrate.
datasets download genome taxon bacteria --reference --include genome,protein --dehydrated --filename bacteria_reference.zip
  1. Unzip the data package:

    unzip bacteria_reference.zip -d bacteria_ref
    Archive:  bacteria_reference.zip
    inflating: bacteria_ref/README.md  
    inflating: bacteria_ref/ncbi_dataset/data/assembly_data_report.jsonl  
    inflating: bacteria_ref/ncbi_dataset/fetch.txt  
    inflating: bacteria_ref/ncbi_dataset/data/dataset_catalog.json
    
  2. Rehydrate the data package. The file fetch.txt will be used to actually retrieve all the data. You don't have to download everything at the same time if you don't need (for example, use the flag --match genomic.fna to only download the genomes, or --match protein to only download the protein files). You can also download the files as gzip with the flag --gzip. Here, I'm going to assume you want to rehydrate everything as gzip files:

    datasets rehydrate --directory bacteria_ref  --gzip
    

    Please feel free to reach out if you have any other questions.

ADD COMMENT
0
Entering edit mode

Hi Mirian,

Thank you for your comment! I wanted to maybe avoid downloading all the bacterial species, and only download the assemblies and annotations of species/strains of interest. What is the best way to do this?

Thank you!

ADD REPLY
1
Entering edit mode

Download using accession numbers : downloading genomes in fasta format from accession ids

ADD REPLY
1
Entering edit mode

Hi Alex,
I understand. With datasets, you have two options:

  1. Download by accession: I'm not super familiar with the kraken2 output. But in case you can access a list of GCA/GCF accessions from it, you can use that list as input to download the genomes of interest. Like this:
datasets download genome accession --inputfile kraken2.txt --reference --include genome,protein --filename kraken2_genomes.zip
  1. If you want to download it by taxon name or taxon id: currently, datasets does not have an option to download by a list of taxa. But you can go around it by creating a loop. Assuming you have a list of taxids:
cat taxids.txt
1505597
1505596
291272
2838947
386585
1158459
624
590

You can loop over that list and download each representative as a separate data package.

cat taxids.txt | while read TAXID; do echo "$TAXID"; datasets download genome taxon "$TAXID" --include genome,protein --reference --filename "$TAXID"_ref.zip; done

1505597
Collecting 1  records [================================================] 100% 1/1
Downloading: 1505597_ref.zip    350kB done
1505596
Collecting 1  records [================================================] 100% 1/1
Downloading: 1505596_ref.zip    350kB done
291272
Collecting 1  records [================================================] 100% 1/1
Downloading: 291272_ref.zip    369kB done
2838947
Collecting 1  records [================================================] 100% 1/1
Downloading: 2838947_ref.zip    2.3MB done
386585
Collecting 1  records [================================================] 100% 1/1
Downloading: 386585_ref.zip    2.72MB done
1158459
Collecting 1  records [================================================] 100% 1/1
Downloading: 1158459_ref.zip    2.41MB done
624
Collecting 1  records [================================================] 100% 1/1
Downloading: 624_ref.zip    2.34MB done
590
Collecting 2  records [================================================] 100% 2/2
Downloading: 590_ref.zip    4.76MB done

I hope this helps. Let me know if you have any other questions. :)

ADD REPLY
0
Entering edit mode
9 weeks ago
GenoMax 125k

There are 15507 assemblies that represent 236000 prokaryotic RefSeq genome collection as of early 2022. A larger collection including archaea is also available.

You could simply download this collection.


If you want to get individual genomes then there are past threads: more elegant way to bulk download genomes from the NCBI and How to download all Pseudomonas aeruginosa Genomes from NCBI Genomes database?

https://github.com/kblin/ncbi-genome-download or https://github.com/pirovc/genome_updater are popular.

NCBI datasets would also be another option for a command line tool.


ADD COMMENT
1
Entering edit mode

I think the idea is to pull "the representative" genome. Other than maybe PAO1 for pseudomonas, I can't think of "the" representative strain for bacterial species...

ADD REPLY
1
Entering edit mode

Every bacterial species will likely have one strain that is used more often. e.g. Escherichia coli str. K-12 substr. MG1655 and others that are in the collection above.

ADD REPLY
0
Entering edit mode

I think this question is asking for both (1) a list of accessions giving the "one strain that is used more often" for each species and (2) a method of downloading the associated genomes. So far I have only seen answers for (2); but I would be interested in the answer to (1) for my own edification.

ADD REPLY
0
Entering edit mode

If you visit the list I linked above you can download the accession numbers of assemblies that NCBI has put together in the RefSeq collection. Use the drop down to change from Summary to ID table. You can download the list by sending it to a file. NCBI's selection of a strain/genome for each organism is likely human curated but may not be perfect.

list

ADD REPLY
0
Entering edit mode

Thanks for clarifying. I interpreted "There are 15507 assemblies that represent 236000 prokaryotic RefSeq genome collection as of early 2022" as meaning you could pull out 15507 assemblies for unique species; not that there was already an ontology term "representative_genome" (and "reference_genome") indicating that manual/automated curation had been performed. That was the missing link.

ADD REPLY

Login before adding your answer.

Traffic: 2413 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6