Question: Bio.Entrez - Get virus host name using NCBI Taxonomy db
gravatar for beegrackle
4.7 years ago by
United States
beegrackle90 wrote:

I'm trying to create a fasta file with all the viral sequences for a particular gene, with taxonomy information in the record description. So far so good, except that while I can see the general host information on the taxonomy page of each virus (For example this virus: has "Host: plants" as part of its entry) that information is not part of the taxonomy database information I get when I do an efetch query using the taxonomy db id number. And I really want that host information! It's right there, taunting me. If anyone knows how to get at it, I'd really appreciate it.

Here is my query, in case it matters:

handle2 = Entrez.efetch(db="Taxonomy", id=taxid, retmode="xml")


Based on what Neilfws wrote, I wrote up some python to scrape the ncbi taxonomy browser for virus host name, for Ruby is Greek to me. Here it is for any other poor saps who need to do this. Depending on the tax uid (and, one presumes, how frisky a PI was feeling when they entered in their sequence), the taxonomy browser sometimes takes you to a list of species links rather than the taxonomy entry, so this code accounts for that....usually.

from bs4 import BeautifulSoup as BS
from urllib2 import urlopen
import re

for tax_id in listoftaxids:
    address = ''+tax_id
    page = urlopen(address)
    soup = BS(page)
    find_string = soup.body.form.find_all('td')
    find = 0
    for i in find_string:
        for match in re.findall('Host:\s'+r'<\/em>'+'(.*?)'+r'<', str(i)):
            print match 
            find += 1
    if find == 0:
        spec_link = soup.body.form.find_all('a', attrs={'title' : 'species'})
        for i in spec_link:
            newaddress = ''+i.get('href')
            newpage = urlopen(newaddress)
            soup1 = BS(newpage)
            find_string = soup1.body.form.find_all('td')
            for i in find_string:
                for match in re.findall('Host:\s'+r'<\/em>'+'(.*?)'+r'<', str(i)):
                    print match 
                    find += 1
    if find == 0:
        print 'SERIOUSLY???'
virus biopython entrez ncbi • 3.6k views
ADD COMMENTlink modified 4.4 years ago by howcouldyouforgetthisusername0 • written 4.7 years ago by beegrackle90

see also: Finding Main Virus Hosts From The Name Of The Virus

ADD REPLYlink written 4.6 years ago by Pierre Lindenbaum125k
gravatar for Neilfws
4.7 years ago by
Sydney, Australia
Neilfws48k wrote:

I'm pretty sure that Host is not returned in the XML of an Entrez query. You can get the same XML that efetch returns by visiting a URL like this one and selecting Send to -> File -> format -> XML, and that does not contain the host.

So all I can suggest is scraping the web page. Which is prone to failure of course, should the HTML change. Currently, there is a single table cell in which information, including the Host, is separated by line breaks. This does not make for easy parsing using e.g. XPath.

I came up with this (rough and ready, no error checks or tests) using Nokogiri for Ruby; I'm sure there's something similar in Python.


require 'nokogiri'
require 'open-uri'

def get_host(uid)
    url   = "" + uid.to_s
    doc   = Nokogiri::HTML.parse(open(url).read)
    data  = doc.xpath("//td").collect { |x| x.inner_html.split("<br>") }.flatten
    data.each do |e|
        puts $1 if e =~ /Host:\s+<\/em>(.*?)$/


Save that as e.g. taxhost.rb, then supply the taxonomy UID as first argument to the script.

$ ruby taxhost.rb 12249
$ ruby taxhost.rb 12721
$ ruby taxhost.rb 11709
vertebrates| human


ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Neilfws48k

Thanks Neilfws, especially for the regex bit :)

ADD REPLYlink written 4.6 years ago by beegrackle90

Pierre suggested I run my code over all virus taxonomy UIDs and create a flat-file "database" so here it is (for 134 378 UIDs as of last week).

ADD REPLYlink written 4.6 years ago by Neilfws48k
gravatar for me
4.5 years ago by
me690 wrote:

You can get this information from the uniprot SPARQL endpoint.

PREFIX up:<http:"" core=""/>
PREFIX taxon:<http:"" taxonomy=""/>
PREFIX rdfs:<http:"" 2000="" 01="" rdf-schema#="">
SELECT ?taxon ?name ?host ?hostName
    ?taxon a up:Taxon .
    ?taxon up:scientificName ?name .
    ?taxon up:host ?host .
    ?host up:scientificName ?hostName .

Most of the total question can then be answered like this.

Assuming you are interested in genes named "POL"


PREFIX rdf:<>
PREFIX skos:<>
PREFIX up:<>
PREFIX taxon:<>
PREFIX rdfs:<>
SELECT ?taxon ?fasta
    ?taxon a up:Taxon .
    ?taxon up:scientificName ?name .
    ?taxon up:host ?host .
    ?host up:scientificName ?hostName .
    ?protein up:organism ?taxon .
    ?protein up:encodedBy/skos:prefLabel "POL" .
    ?protein up:sequence ?sequence .
    ?sequence rdf:value ?realseq .
    BIND(CONCAT(">",SUBSTR(STR(?protein), 33),"\n",?realseq,"\n") AS ?fasta)
ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by me690
gravatar for howcouldyouforgetthisusername
4.4 years ago by
United States

Using data from the NCBI taxonomy browser will only get you so far -- you will get things like this is a "plant" virus but not this is a "tobacco" virus. To improve on this, you can download "all.asn.tar.gz" from the NCBI viral genome ftp:

The asn files should contain a section with something like:

"subtype nat-host,
 subname "Haliotis rubra"

You can write a script to automatically extract this data. You can then translate the name of the host to the taxid of the host, and then you're in a golden situation and can do anything; get the complete lineage of the host, etc. I'd recommend looking into the ete toolkit's NCBI module if you are using Python.


ADD COMMENTlink written 4.4 years ago by howcouldyouforgetthisusername0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 746 users visited in the last hour