Python assistance -ete3
0
0
Entering edit mode
2.0 years ago
Gino • 0

Hey guys, I'm new to python and general bioinformatics.

I'm currently working on a project that requires I translate information from two excel files (Each with column for species/ common name) into a taxonomy ID. Since the orignal species/common names are not always accurate, I found a function online that would find the best correct species name. There is also a function that will translate the species name to taxonomy ID. Both functions are found under ETE3

I don't know what values/variables would go in the functions(at the end of the list) to get a result.

My current code in python(Visual Studio Code) after activating anaconda is

import pandas as pd
import numpy as np
import ete3
pip install ncbi-taxonomist

Which gives Note: you may need to restart the kernel to use updated packages.

from ete3 import NCBITaxa
ncbi = NCBITaxa()
ncbi.update_taxonomy_database()

def get_fuzzy_name_translation(self, name, sim=0.9):
    '''
    Given an inexact species name, returns the best match in the NCBI database of taxa names.
    :argument 0.9 sim: Min word similarity to report a match (from 0 to 1).
    :return: taxid, species-name-match, match-score
    '''


    import sqlite3.dbapi2 as dbapi2
    _db = dbapi2.connect(self.dbfile)
    _db.enable_load_extension(True)
    module_path = os.path.split(os.path.realpath(__file__))[0]
    _db.execute("select load_extension('%s')" % os.path.join(module_path,
                                                             "SQLite-Levenshtein/levenshtein.sqlext"))


    print("Trying fuzzy search for %s" % name)
    maxdiffs = math.ceil(len(name) * (1-sim))
    cmd = 'SELECT taxid, spname, LEVENSHTEIN(spname, "%s") AS sim  FROM species WHERE sim<=%s ORDER BY sim LIMIT 1;' % (name, maxdiffs)
    taxid, spname, score = None, None, len(name)
    result = _db.execute(cmd)
    try:
        taxid, spname, score = result.fetchone()
    except TypeError:
        cmd = 'SELECT taxid, spname, LEVENSHTEIN(spname, "%s") AS sim  FROM synonym WHERE sim<=%s ORDER BY sim LIMIT 1;' % (name, maxdiffs)
        result = _db.execute(cmd)
        try:
            taxid, spname, score = result.fetchone()
        except:
            pass
        else:
            taxid = int(taxid)
    else:
        taxid = int(taxid)

    norm_score = 1 - (float(score)/len(name))
    if taxid:
        print("FOUND!    %s taxid:%s score:%s (%s)" %(spname, taxid, score, norm_score))

    return taxid, spname, norm_score

and

def get_name_translator(self, names):
    """
    Given a list of taxid scientific names, returns a dictionary translating them into their corresponding taxids.
    Exact name match is required for translation.
    """

    name2id = {}
    #name2realname = {}
    name2origname = {}
    for n in names:
        name2origname[n.lower()] = n

    names = set(name2origname.keys())

    query = ','.join(['"%s"' %n for n in six.iterkeys(name2origname)])
    cmd = 'select spname, taxid from species where spname IN (%s)' %query
    result = self.db.execute('select spname, taxid from species where spname IN (%s)' %query)
    for sp, taxid in result.fetchall():
        oname = name2origname[sp.lower()]
        name2id.setdefault(oname, []).append(taxid)
        #name2realname[oname] = sp
    missing =  names - set([n.lower() for n in name2id.keys()])
    if missing:
        query = ','.join(['"%s"' %n for n in missing])
        result = self.db.execute('select spname, taxid from synonym where spname IN (%s)' %query)
        for sp, taxid in result.fetchall():
            oname = name2origname[sp.lower()]
            name2id.setdefault(oname, []).append(taxid)
            #name2realname[oname] = sp
    return name2id

>> All of these codes run fine, my problem is figuring out how to get results(valid values/variables for ?'s) from a non-accurate species name into an accurate species name using:

    from ete3 import NCBITaxa
    ncbi= NCBITaxa
    fuzzy_name = ncbi.get_fuzzy_name_translation(?,?,?)
    print (dog?,0.9?)

Also how to get taxonomy IDs using 

    from ete3 import NCBITaxa
    ncbi= NCBITaxa
    taxid_name = ncbi.get_name_translator(?)
    print (?)

I ran

help(get_fuzzy_name_translation)
help(get_name_translator)

and got

Help on function get_fuzzy_name_translation in module __main__: 

get_fuzzy_name_translation(self, name, sim=0.9)
Given an inexact species name, returns the best match in the NCBI database of taxa names.
 :argument 0.9 sim: Min word similarity to report a match (from 0 to 1). 
:return: taxid, species-name-match, match-score

Help on function get_name_translator in module __main__:
get_name_translator(self, names)
 Given a list of taxid scientific names, returns a dictionary translating them into their corresponding taxids.
 Exact name match is required for translation.

I apologize for the long post and bad formatting of codes, I tried my best to give information as clear as possible.

Any pointers would be great! I'm working on it everyday to try and figure it out.

etetoolkit • 776 views
ADD COMMENT

Login before adding your answer.

Traffic: 1804 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6