Automated way to pull representative/RefSeq genome using full species name?
2
0
Entering edit mode
9 weeks ago
predeus ★ 1.8k

Hi all,

I was wondering if there's a streamlined way of getting the "representative" genome for a particular species. What I'm looking for is an automated way of retrieving a genome assembly and annotation for all bacterial species in, e.g. Kraken2 output. The most popular databases for Kraken2 are made from RefSeq, so I'd imagine there should be a relatively easy way to match Taxonomy ID/species name to a RefSeq ID.

Any advice would be appreciated, as always!

All the best

-- Alex

ncbi taxonomy refseq • 709 views
1
Entering edit mode
8 weeks ago
MirianT_NCBI ▴ 390

Hi Alex,
You can use NCBI Datasets. datasets allows users to download genome and associated annotation files as a data package. In this case, since it's a pretty large download (all bacterial genomes), we recommend downloading data as a "dehydrated" package. You can read more about it in this How-to guide: Download large genome data packages.

For all bacteria, the steps would be:

1. Download a dehydrated data package including genome and protein sequences. Here you need to decide if you want representative + reference genomes (--reference flag) or Refseq only (--assembly-source refseq). I'm using the --reference flag to demonstrate.
datasets download genome taxon bacteria --reference --include genome,protein --dehydrated --filename bacteria_reference.zip

1. Unzip the data package:

unzip bacteria_reference.zip -d bacteria_ref
Archive:  bacteria_reference.zip
inflating: bacteria_ref/ncbi_dataset/data/assembly_data_report.jsonl
inflating: bacteria_ref/ncbi_dataset/fetch.txt
inflating: bacteria_ref/ncbi_dataset/data/dataset_catalog.json

2. Rehydrate the data package. The file fetch.txt will be used to actually retrieve all the data. You don't have to download everything at the same time if you don't need (for example, use the flag --match genomic.fna to only download the genomes, or --match protein to only download the protein files). You can also download the files as gzip with the flag --gzip. Here, I'm going to assume you want to rehydrate everything as gzip files:

datasets rehydrate --directory bacteria_ref  --gzip


Please feel free to reach out if you have any other questions.

0
Entering edit mode

Hi Mirian,

Thank you for your comment! I wanted to maybe avoid downloading all the bacterial species, and only download the assemblies and annotations of species/strains of interest. What is the best way to do this?

Thank you!

1
Entering edit mode

1
Entering edit mode

Hi Alex,
I understand. With datasets, you have two options:

1. Download by accession: I'm not super familiar with the kraken2 output. But in case you can access a list of GCA/GCF accessions from it, you can use that list as input to download the genomes of interest. Like this:
datasets download genome accession --inputfile kraken2.txt --reference --include genome,protein --filename kraken2_genomes.zip

1. If you want to download it by taxon name or taxon id: currently, datasets does not have an option to download by a list of taxa. But you can go around it by creating a loop. Assuming you have a list of taxids:
cat taxids.txt
1505597
1505596
291272
2838947
386585
1158459
624
590


You can loop over that list and download each representative as a separate data package.

cat taxids.txt | while read TAXID; do echo "$TAXID"; datasets download genome taxon "$TAXID" --include genome,protein --reference --filename "\$TAXID"_ref.zip; done

1505597
Collecting 1  records [================================================] 100% 1/1
1505596
Collecting 1  records [================================================] 100% 1/1
291272
Collecting 1  records [================================================] 100% 1/1
2838947
Collecting 1  records [================================================] 100% 1/1
386585
Collecting 1  records [================================================] 100% 1/1
1158459
Collecting 1  records [================================================] 100% 1/1
624
Collecting 1  records [================================================] 100% 1/1
590
Collecting 2  records [================================================] 100% 2/2


I hope this helps. Let me know if you have any other questions. :)

0
Entering edit mode
9 weeks ago
GenoMax 125k

There are 15507 assemblies that represent 236000 prokaryotic RefSeq genome collection as of early 2022. A larger collection including archaea is also available.

If you want to get individual genomes then there are past threads: more elegant way to bulk download genomes from the NCBI and How to download all Pseudomonas aeruginosa Genomes from NCBI Genomes database?

NCBI datasets would also be another option for a command line tool.

1
Entering edit mode

I think the idea is to pull "the representative" genome. Other than maybe PAO1 for pseudomonas, I can't think of "the" representative strain for bacterial species...

1
Entering edit mode

Every bacterial species will likely have one strain that is used more often. e.g. Escherichia coli str. K-12 substr. MG1655 and others that are in the collection above.

0
Entering edit mode

I think this question is asking for both (1) a list of accessions giving the "one strain that is used more often" for each species and (2) a method of downloading the associated genomes. So far I have only seen answers for (2); but I would be interested in the answer to (1) for my own edification.

0
Entering edit mode

If you visit the list I linked above you can download the accession numbers of assemblies that NCBI has put together in the RefSeq collection. Use the drop down to change from Summary to ID table. You can download the list by sending it to a file. NCBI's selection of a strain/genome for each organism is likely human curated but may not be perfect.

0
Entering edit mode

Thanks for clarifying. I interpreted "There are 15507 assemblies that represent 236000 prokaryotic RefSeq genome collection as of early 2022" as meaning you could pull out 15507 assemblies for unique species; not that there was already an ontology term "representative_genome" (and "reference_genome") indicating that manual/automated curation had been performed. That was the missing link.