Question: Database Of Host And Pathogen Pairs
7.0 years ago by
United States
Will4.5k wrote:

I'm looking to do a project on predicting which bacteria colonize a particular host and determine the genomic features which determine these interactions.

Does anyone know of a good database which annotates any known interactions? I know I could pull the cross-organism interactions from a PPI database like BIND but that only gives a handful of examples and seems to be over restrictive.

7.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

See how @rdmpage built a database of host-pathogens using genbank:

Back in 2006 in a short post entitled "Building the encyclopedia of life" I wrote that GenBank is a potentially rich source of information on host-parasite relationships. Often sequences of parasites will include information on the name of the host (the example I used was sequence AF131710 from the platyhelminth Ligophorus mugilinus, which records the host as the Flathead mullet Mugil cephalus).

Edit: another one: pathogen-host interactions database.

This database contains expertly curated molecular and biological information on genes proven to affect the outcome of pathogen-host interactions. Information is also given on the target sites of some anti-infective chemistries.

I've been meaning to take this project beyond the blog post stage. If there's enough interest I could look at creating a web site and services around the host-parasite data in GenBank.

Big (depending on taxonomic scope). I built the visualisations from a subset of GenBank (mainly eukaryote non-EST sequences). If you ask GenBank how many sequences have the "host" field today it has 3,466,914

excellent ... wouldn't have thought to look in Genbank!

@roderic: I've already got a dirty-python script to extract the data. Do you remember roughly how many associations you found? I'm just trying to get an idea for how large this symbiome is going to be.

Just discovered that the search in my previous comment will search for "host" anywhere in the sequence record, so it will return sequences without a "host" field but with host in, say, the title of the article that published the sequence. So the figure of number of hosts will be an overestimate.

yeah, parsing through now ... looks to be ~1.7 million triples (genbank-record, host, symbiote)

7.0 years ago by
Hamish3.1k wrote:

The pathogen specific resources might be a useful starting point. For example:

If you are including viruses then UniProtKB may be a useful source, since it details organism/host for viruses. Sadly they don't seem to have included other organism/host relationships.

Other possible sources that come to mind are:

  • metabolic pathway databases. For example KEGG PATHWAY has a set of pathways related to infectious disease that could be useful.
  • microarray expression databases. For example ArrayExpress contains details of experiments looking for changes in gene expression related to various disease states (try a search for terms like 'infected').

While I suspect that these will suffer from similar limitations to the PPI data they are worth looking at. Additional pairings from the analysis of the INSDC databases, suggested Roderic, will extend coverage. As will text-mining of the literature.

7.0 years ago by
Blacksburg, VA USA
Behindtherabbit60 wrote:

there are over 9000 experimentally confirmed host-pathogen PPIs from bacteria available from public PPI databases. if you include viruses as well, that number is much higher. several public (and published) resources cull unique HP-PPI pairs from the broader databases, including PATRIC (our group), HPIdb, PHI-Base, and more. also check out the PSICQUIC Web interface at EBI which lets you query many public dbs yourself. hope that helps!

I don't really need the PPI level info, I just need the organism interaction level. When I looked through BIND and NCBI's repository I only found ~500 unique host-pathogen associations.

7.0 years ago by
Casey Bergman17k
Athens, GA, USA
Casey Bergman17k wrote:

You could also try EnvDB, a "database that aims to provide the most complete census to-date of the environmental distribution of prokaryotes". Under "environments" select "host associated"

