Restricting ncbi-genome-download (bacteria, viral, fungi, archaea, protozoa) to a specif host
2
0
Entering edit mode
7 hours ago
saruman ▴ 10

Hello everyone,

I’m looking to download all complete bacterial, viral, archaeal, and protozoal genomes from NCBI using ncbi-genome-download.

ncbi-genome-download --formats fasta,assembly-report --parallel 20 --progress-bar --section refseq --flat-output --assembly-levels complete bacteria,viral,fungi,archaea,protozoa

However, I need to restrict these genomes to a specific host—Human, in my case. Since I know ncbi-genome-download does not offer a direct option to specify host, I was wondering if there’s a fast or efficient workaround.

Has anyone faced this issue before or found a practical solution?

Thank you in advance for your help!

ncbi • 750 views
ADD COMMENT
2
Entering edit mode
5 hours ago
GenoMax 153k

There may be a better answer, but let us start here. This is a two (or more step) solution using NCBI datasets (LINK).

NOTE: I am restricting the following solution to RefSeq genomes, since they are likely to be of best quality. If you just want complete genomes then you could go with genbank instead.

Step 1: You can grab the accession numbers (GCA* or GCF* numbers) using datasets. You can download the table with accessions for bacteria here: https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=2

Step 2: filter to find out which accessions to use where the "host" is listed as human.

$ datasets summary genome accession GCF_001281985.1 --assembly-source refseq --as-json-lines | dataformat tsv genome --fields accession,assminfo-name,assminfo-biosample-host
Assembly Accession      Assembly Name   Assembly BioSample Host 
GCF_001281985.1 ASM128198v1     Homo sapiens

A more generic example:

$ datasets summary genome taxon "Escherichia" --assembly-source refseq --as-json-lines | dataformat tsv genome --fields accession,assminfo-name,assminfo-biosample-host | grep sapiens 
Assembly Accession      Assembly Name   Assembly BioSample Host    
GCF_001286085.1 7790_1#78       Homo sapiens
GCF_002563295.1 ASM256329v1     Homo sapiens
GCF_002895205.1 ASM289520v1     Homo sapiens
GCF_002965635.1 ASM296563v1     Homo sapiens
GCF_002965665.1 ASM296566v1     Homo sapiens
GCF_002965685.1 ASM296568v1     Homo sapiens
GCF_002965725.1 ASM296572v1     Homo sapiens
GCF_002965745.1 ASM296574v1     Homo sapiens
GCF_003569025.1 ASM356902v1     Homo sapiens
GCF_004322685.1 ASM432268v1     Homo sapiens

Step 3: Get the genomes for accessions from list in step 2 using datasets or the tool above, if it accepts a file with accessions.

ADD COMMENT
0
Entering edit mode

It is astonishing that there is no straightforward way to match accessions and their host. Anyhow, I will use datasets as suggested, though on a much larger list of accession numbers (https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=1) compared to step 1, which includes more than 3 million records. Thank you for your help.

ADD REPLY
0
Entering edit mode
5 hours ago
Mensur Dlakic ★ 30k

As you said, the package does not support the search for hosts. Furthermore, I don't think there are any host links in NCBI records, so it is very doubtful that this software can access that type of information.

I think that leaves it to you to find the list of human pathogens independently, then use that list of taxonomic IDs:

ncbi-genome-download --taxids 9606,9685

There are databases of pathogens and their hosts:

For example, here is a search of the PHI-BASE for human as a host:

http://www.phi-base.org/searchFacet.htm?queryTerm=homo+sapiens+%28related%3A+human%29

Creating a non-redundant list of the third column in that output will give you a list of pathogens that infect humans.

ADD COMMENT
1
Entering edit mode

I don't think there are any host links in NCBI records

This information is not available for all genomes, but it is there for some of the records. That is how the dataset answer above is able to find the "Biosample host". These genomes are from organisms, where the source was listed as human.

Since OP is looking for genomes where the host is explicitly listed, using a generic name may pull out genomes that do not satisfy the human host limit. e.g without the grep for human, we can find genome accessions that are from non-human/no-host/source listed.

$ datasets summary genome taxon "Escherichia" --assembly-source refseq --as-json-lines | dataformat tsv genome --fields accession,assminfo-name,assminfo-biosample-host | head -10
Assembly Accession      Assembly Name   Assembly BioSample Host 
GCF_028622335.1 ASM2862233v1    Columba livia
GCF_001286085.1 7790_1#78       Homo sapiens
GCF_001514555.1 ASM151455v1
GCF_001514575.1 ASM151457v1
GCF_001514595.1 ASM151459v1
GCF_001514625.1 ASM151462v1
GCF_001514645.1 ASM151464v1
ADD REPLY
0
Entering edit mode

Always good to learn new things.

ADD REPLY

Login before adding your answer.

Traffic: 3535 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6