Bio.Entrez - Get virus host name using NCBI Taxonomy db
3
2
Entering edit mode
7.0 years ago
beegrackle ▴ 90

I'm trying to create a fasta file with all the viral sequences for a particular gene, with taxonomy information in the record description. So far so good, except that while I can see the general host information on the taxonomy page of each virus (For example this virus: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1221449 has "Host: plants" as part of its entry) that information is not part of the taxonomy database information I get when I do an efetch query using the taxonomy db id number. And I really want that host information! It's right there, taunting me. If anyone knows how to get at it, I'd really appreciate it.

Here is my query, in case it matters:

handle2 = Entrez.efetch(db="Taxonomy", id=taxid, retmode="xml")

Edit:

Based on what Neilfws wrote, I wrote up some python to scrape the ncbi taxonomy browser for virus host name, for Ruby is Greek to me. Here it is for any other poor saps who need to do this. Depending on the tax uid (and, one presumes, how frisky a PI was feeling when they entered in their sequence), the taxonomy browser sometimes takes you to a list of species links rather than the taxonomy entry, so this code accounts for that....usually.

from bs4 import BeautifulSoup as BS
from urllib2 import urlopen
import re

for tax_id in listoftaxids:
soup = BS(page)
find_string = soup.body.form.find_all('td')
find = 0
for i in find_string:
for match in re.findall('Host:\s'+r'<\/em>'+'(.*?)'+r'<', str(i)):
print match
find += 1
if find == 0:
spec_link = soup.body.form.find_all('a', attrs={'title' : 'species'})
soup1 = BS(newpage)
find_string = soup1.body.form.find_all('td')
for i in find_string:
for match in re.findall('Host:\s'+r'<\/em>'+'(.*?)'+r'<', str(i)):
print match
find += 1
if find == 0:
print 'SERIOUSLY???'
biopython entrez ncbi virus • 5.2k views
0
Entering edit mode
3
Entering edit mode
7.0 years ago
Neilfws 49k

I'm pretty sure that Host is not returned in the XML of an Entrez query. You can get the same XML that efetch returns by visiting a URL like this one and selecting Send to -> File -> format -> XML, and that does not contain the host.

So all I can suggest is scraping the web page. Which is prone to failure of course, should the HTML change. Currently, there is a single table cell in which information, including the Host, is separated by line breaks. This does not make for easy parsing using e.g. XPath.

I came up with this (rough and ready, no error checks or tests) using Nokogiri for Ruby; I'm sure there's something similar in Python.

#!/usr/bin/ruby

require 'nokogiri'
require 'open-uri'

def get_host(uid)
url   = "http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&lvl=3&lin=f&keep=1&srchmode=1&unlock&id=" + uid.to_s
data  = doc.xpath("//td").collect { |x| x.inner_html.split("<br>") }.flatten
data.each do |e|
puts $1 if e =~ /Host:\s+<\/em>(.*?)$/
end
end

get_host(ARGV[0])

Save that as e.g. taxhost.rb, then supply the taxonomy UID as first argument to the script.

$ruby taxhost.rb 12249 plants$ ruby taxhost.rb 12721
vertebrates
\$ ruby taxhost.rb 11709
vertebrates| human

0
Entering edit mode

Thanks Neilfws, especially for the regex bit :)

0
Entering edit mode

Pierre suggested I run my code over all virus taxonomy UIDs and create a flat-file "database" so here it is (for 134 378 UIDs as of last week).

1
Entering edit mode
6.8 years ago
me ▴ 740

You can get this information from the uniprot SPARQL endpoint.

PREFIX up:<http: purl.uniprot.org="" core=""/>
PREFIX taxon:<http: purl.uniprot.org="" taxonomy=""/>
PREFIX rdfs:<http: www.w3.org="" 2000="" 01="" rdf-schema#="">
SELECT ?taxon ?name ?host ?hostName
WHERE
{
?taxon a up:Taxon .
?taxon up:scientificName ?name .
?taxon up:host ?host .
?host up:scientificName ?hostName .
}

Most of the total question can then be answered like this.

Assuming you are interested in genes named "POL"

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT ?taxon ?fasta
WHERE
{
?taxon a up:Taxon .
?taxon up:scientificName ?name .
?taxon up:host ?host .
?host up:scientificName ?hostName .
?protein up:organism ?taxon .
?protein up:encodedBy/skos:prefLabel "POL" .
?protein up:sequence ?sequence .
?sequence rdf:value ?realseq .
BIND(CONCAT(">",SUBSTR(STR(?protein), 33),"\n",?realseq,"\n") AS ?fasta)
}
0
Entering edit mode
6.8 years ago

Using data from the NCBI taxonomy browser will only get you so far -- you will get things like this is a "plant" virus but not this is a "tobacco" virus. To improve on this, you can download "all.asn.tar.gz" from the NCBI viral genome ftp: ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/

The asn files should contain a section with something like:

"subtype nat-host,
subname "Haliotis rubra"

You can write a script to automatically extract this data. You can then translate the name of the host to the taxid of the host, and then you're in a golden situation and can do anything; get the complete lineage of the host, etc. I'd recommend looking into the ete toolkit's NCBI module if you are using Python.