Question: DAVID not accepting locus_tag identifiers. Alternatives?
gravatar for ThePresident
14 months ago by
ThePresident100 wrote:

Related to my previous post, I have a list of genes that have been pooled out from around 60 different bacterial species based on some criteria. Now, I would like to see if there is any functional enrichment in this group. My idea was to use DAVID, however none of the locus_tag inputs are recognized. When I manually try some of them in the NCBI's Gene database, they are found but often marked as "discontinued" or "updated".

Any idea how to work around this problem? locus_tag is the only common gene flag among all these different bacterial species. Is there another tool for functional clustering that will either accept locus_tag or is there a way around to translate those locus_tag 's into something more readable by DAVID?

Also, I was thinking of extracting the amino acid sequence of the hits and use that as query to get protein domain signatures. Is there a tool that performs functional clustering based on protein signatures and domains?

Thank you, TP

locus_tag david • 699 views
ADD COMMENTlink written 14 months ago by ThePresident100

Can you share an example of some locus tags so that I can see what can do with them?

ADD REPLYlink written 14 months ago by Kevin Blighe33k

Certainly! Here are some representative examples:


Thank you for taking a look at this.

ADD REPLYlink written 14 months ago by ThePresident100

I was able to recognise some of these using DAVID, but only by using the previous version (DAVID 6.7):

I was also trying to search for them in Entrez using the following Python script:

import sys

import argparse

from Bio import Entrez, SeqIO
from xml.dom import minidom

parser = argparse.ArgumentParser(description='Specify a field of NCBI locus tags, which get looked up in RefSeq for a corresponding gene name, which is appended to the line.')

parser.add_argument('-f', action='store', dest='locus_tag_col_str', required=True, help='The locus tag field in the tab-delimited input file.')
parser.add_argument('-e', action='store', dest='email_address', required=True, help='Entrez requires your email address.')
parser.add_argument('infile_name', help='Input file')

args = parser.parse_args() = args.email_address

locus_tag_col = int(args.locus_tag_col_str) - 1

with open(args.infile_name, 'r') as infile:
    for line in infile:
            gene_name = '-'

        locus_tag = line.split()[locus_tag_col]

        search_term = 'refseq[FILTER] AND {}'.format(locus_tag)

        handle = Entrez.esearch(db='nucleotide', term=search_term)

        results =


        for id in results['IdList']:
            print id

            handle = Entrez.efetch(db='nucleotide', id=id, retmode='txt')

            result =

Execute this as python -f 1 -e locustags.list (locustags.list just contains a single list of your locus_tags).

Using esearch, this script is capable of finding the species in which each locus_tag is found, but only in the Entrez nucleotide or nuccore databases, and I have been struggling to then extract the gene name for each using efetch. I have noted that these locus_tags appear to have mostly been discontinued.

Note however, that with the species ID you can then download a txt file of the species and possibly parse out the info of interest. Here's an example for CTLon_0753:

I'm sure that there's still a way to do this, but I have ran out of time.

ADD REPLYlink written 14 months ago by Kevin Blighe33k

Thank you Kevin, I really appreciate the effort. One thing I didn't mention is that I have or rather know all bacterial species associated with the locus_tag's. The problem is that some of them contain more relevant info (such as GeneID or "old locus_tag" that is sometimes recognized by DAVID), but this is a minority. The only consistent and unique feature that was common to all of them is locus_tag which is unfortunately not recognized by DAVID.

My next step is to pull out amino acid sequences from each of the loci (which I've done) and then use InterPro to get all associated conserved domains. I am hoping I can find a tool that can do functional clustering based on these or I'll just use it as a proxy to get some enriched functions and work with these limitations in mind.

PS - your script will come handy for some other projects I have, so thank you!

ADD REPLYlink written 14 months ago by ThePresident100

Don't thank me! These guys here have some neat scripts for interrogating Entrez/RefSeq:

ADD REPLYlink written 14 months ago by Kevin Blighe33k

It's easy to modify these for custom use though: A: Need help to retrive sequences

Good luck!

ADD REPLYlink written 14 months ago by Kevin Blighe33k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1639 users visited in the last hour