Question: Retrieving Official Gene Symbols From Full Length Protein Names Automatically
5
gravatar for Eric Normandeau
7.6 years ago by
Eric Normandeau9.9k
Quebec, Canada
Eric Normandeau9.9k wrote:

Hi,

I have a list of a few hundred protein names and I would like to be able to retrieve their names automatically. For example:

  • Dihydroorotate dehydrogenase, mitochondrial precursor
  • ubiquitin-protein ligase
  • AP-2 complex subunit mu-1-A
  • Proliferation-associated protein 2G4

Becomes:

  • Dhodh
  • Rnf19a
  • Ap2m1
  • MRPL4

I am presently using UniProtKB manually, but I would very much like to automatize it. Would anyone have a suggestion about the following:

  • What database to use?
  • What approach/program/package to query it?
  • Online vs. downloading the database?
  • Any other means of doing this?

I don't mind having to write a parser for a database if needed, but I don't know what source to start with.

Thanks!

ADD COMMENTlink modified 7.6 years ago by Neilfws48k • written 7.6 years ago by Eric Normandeau9.9k

I think a better title for this question might be "Retrieving official gene/protein symbols from full length gene/protein names automatically"

ADD REPLYlink written 7.6 years ago by Casey Bergman17k

@Casey: Done :)

ADD REPLYlink written 7.6 years ago by Eric Normandeau9.9k

Thank you all for your comments and suggestions! Having the latest hot computer --> 2500$; One full run of 454 sequencing --> 6000$; Biostar Forum --> Priceless ;)

ADD REPLYlink written 7.6 years ago by Eric Normandeau9.9k
10
gravatar for Michael Kuhn
7.6 years ago by
Michael Kuhn4.9k
EMBL Heidelberg
Michael Kuhn4.9k wrote:

You can use the STRING API for this, like so:

echo "Dihydroorotate dehydrogenase, mitochondrial precursor" | \
xargs -i wget -nv -O - \
'http://stitch.embl.de/api/tsv-no-header/resolve?identifier={}&species=9606&echo_query=1' \
> protein_names.tsv

which gives you, among the Ensembl id, also the gene name DHODH:

Dihydroorotate dehydrogenase, mitochondrial precursor   9606.ENSP00000219240    9606    Homo sapiens    DHODH   Dihydroorotate dehydrogenase, mitochondrial precursor (EC 1.3.3.1) (Dihydroorotate oxidase) (DHOdehase)

Plus, now you have valid STRING identifiers you can use to query the network. :-)

ADD COMMENTlink written 7.6 years ago by Michael Kuhn4.9k

Thanks for this method, Micheal. I'll finish my boring manual annotation and will test against the results obtained with the STRING API. I'll try to automatize extraction of the results in the numerous cases where there are many, but that may be tricky. This may end up being 'Computer assisted manual annotation' :)

ADD REPLYlink written 7.6 years ago by Eric Normandeau9.9k

This is actually what I did a while ago when mapping a set of protein names (extracted from a collaborators Excel table...): pipe all names into the API, and then edit the protein_names.tsv file to prune mismatches.

ADD REPLYlink written 7.6 years ago by Michael Kuhn4.9k
3
gravatar for Larry_Parnell
7.6 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

Good question because this is a common task. In fact, it would be nice if someone could make available their table mapping protein name to gene/protein symbol because the task does not need to be repeated. I could definitely use this for human, mouse and rat.

You could try the HUGO / HGNC site for a list of the accepted or official names and symbols.

ADD COMMENTlink written 7.6 years ago by Larry_Parnell16k
1

The mapping tables that are used by STRING in the solution by Michael Kuhn are freely available from the STRING download page :-)

ADD REPLYlink written 7.6 years ago by Lars Juhl Jensen11k

Thank you, Lars. This is exactly what I meant - and not knowing of the resource creates an obstacle to my work moving forward. That's solved!

ADD REPLYlink written 7.6 years ago by Larry_Parnell16k
2
gravatar for Pierre Lindenbaum
7.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum112k wrote:

The following java program use NCBI-Utilities to query the Gene database.

If only one item is found, it prints the official gene symbol to stdout.

Else there is an ambiguity: it displays an interactive table and asks the user to select the correct row.

import java.awt.Dimension;
import java.net.URLEncoder;
import java.util.ArrayList;
import java.util.List;
import java.util.logging.Logger;

import javax.swing.JOptionPane;
import javax.swing.JScrollPane;
import javax.swing.JTable;
import javax.swing.ListSelectionModel;
import javax.swing.table.DefaultTableModel;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;

import org.w3c.dom.Document;
import org.w3c.dom.NodeList;

public class Biostar5460
    {
    private Logger LOG=Logger.getLogger("Biostar5460");
    private class Item
        {
        String id="";
        String Prot_ref_desc="";
        String Entrezgene_summary="";
        String locus;
        Item(String id)
            {
            this.id=id;
            }
        }

    private DocumentBuilder builder;
    private XPath xpath;
    private Biostar5460() throws Exception
        {
        DocumentBuilderFactory factory=DocumentBuilderFactory.newInstance();
        factory.setCoalescing(true);
        factory.setNamespaceAware(false);
        factory.setExpandEntityReferences(true);
        factory.setValidating(false);
        factory.setIgnoringComments(true);
        factory.setIgnoringElementContentWhitespace(true);
        builder=factory.newDocumentBuilder();

        this.xpath=XPathFactory.newInstance().newXPath();
        }
    private void search(String term) throws Exception
        {
        LOG.info(term);
        String uri="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&retmode=xml&tool=biostar5460" +
            "&mail=me_at_nowhere_com&term="+
            URLEncoder.encode(term+" \"Homo sapiens\"[ORGN]","UTF-8");
        LOG.info(uri);
        Document dom=builder.parse(uri);
        NodeList idList=(NodeList)this.xpath.evaluate("/eSearchResult/IdList/Id", dom, XPathConstants.NODESET);
        if(idList.getLength()==0)
            {
            System.out.println("#NOT-FOUND\t"+term);
            return;
            }
        List<Item> array=new ArrayList<Item>(idList.getLength());
        for(int i=0;i< idList.getLength();++i)
            {
            LOG.info((i+1)+"/"+idList.getLength());
            Item item=new Item(idList.item(i).getTextContent());
            uri="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=xml&retmax=100&id="+item.id;
            LOG.info(uri);
            dom=builder.parse(uri);
            item.locus=(String)xpath.evaluate("/Entrezgene-Set/Entrezgene/Entrezgene_gene/Gene-ref/Gene-ref_locus", dom,XPathConstants.STRING);
            item.Prot_ref_desc=(String)xpath.evaluate("/Entrezgene-Set/Entrezgene/Entrezgene_prot/Prot-ref/Prot-ref_desc", dom,XPathConstants.STRING);
            item.Entrezgene_summary=(String)xpath.evaluate("/Entrezgene-Set/Entrezgene/Entrezgene_summary", dom,XPathConstants.STRING);
            array.add(item);
            }
        if(array.size()==1)
            {
            System.out.println(array.get(0).locus+"\t"+term);
            }
        else
            {
            DefaultTableModel m=new DefaultTableModel(new String[]{"id","locus","desc","summary"}, array.size());
            for(int i=0;i< array.size();++i)
                {
                Item item=array.get(i);
                m.setValueAtitem.id, i, 0);
                m.setValueAt(item.locus, i, 1);
                m.setValueAt(item.Prot_ref_desc, i, 2);
                m.setValueAt(item.Entrezgene_summary, i, 3);
                }
            JTable table=new JTable(m);
            table.setSelectionMode(ListSelectionModel.SINGLE_SELECTION);
            JScrollPane scroll=new JScrollPane(table);
            scroll.setPreferredSize(new Dimension(800,500));
            if(JOptionPane.showConfirmDialog(null, scroll,
                    "Select",
                    JOptionPane.OK_CANCEL_OPTION,JOptionPane.QUESTION_MESSAGE,null)
                    !=JOptionPane.OK_OPTION)
                {
                System.out.println("#NOT-FOUND\t"+term);
                return;
                }
            if(table.getSelectedRow()==-1)
                {
                System.out.println("#NOT-SELECTED\t"+term);
                return;
                }
            System.out.println(array.get(table.getSelectedRow()).locus+"\t"+term);
            }   
        }
    public static void main(String[] args)
        {
        try {
            Biostar5460 app=new Biostar5460();
            for(int i=0;i< args.length;++i)
                {
                app.search(args[i]);
                }
            } 
        catch (Exception e)
            {
            e.printStackTrace();
            }
        }
    }

compilation:

javac Biostar5460.java

execution:

java Biostar5460 "Dihydroorotate dehydrogenase, mitochondrial precursor" "ubiquitin-protein ligase" > output.tsv

ADD COMMENTlink written 7.6 years ago by Pierre Lindenbaum112k
2
gravatar for Casey Bergman
7.6 years ago by
Casey Bergman17k
Athens, GA, USA
Casey Bergman17k wrote:

For some species, like D. melanogaster, there are look-up tables between full-length gene/protein name synonyms and their gene symbols that you could try to parse directly.

More generally, I think your problem is the same as the gene/protein name normalization (GNN) problem, which is currently a matter of active research in the text mining community. If so, then it appears there is no current solution to resolve full length gene/protein names to database identifiers and thence to official gene IDs, as in your case.

The state of the art methods in gene/protein name normalization problem are GNAT and geneTUKit, but they still may not do as well as you like. I also suspect that the same problems experienced by GNN will be experienced by the solutions proposed by Pierre and Michael (which I think are nevertheless both valid and worth trying). For example, running Michael's STRING approach yields the following promising, but not bullet-proof, results:

num_hits  official_name_found   full_gene_name 
1         Dhodh found           Dihydroorotate dehydrogenase, mitochondrial precursor 
193       Rnf19a not found *    ubiquitin-protein ligase 
3         Ap2m1 found           AP-2 complex subunit mu-1-A
1         MRPL4 not found       Proliferation-associated protein 2G4

* several other RNF family members found

Thus you may not get a single hit or the desired gene name with this approach or any other. Unfortunately, inherent variability in gene/protein name usage may be the enemy in the search for a fully automated solution to this problem.

ADD COMMENTlink written 7.6 years ago by Casey Bergman17k
2
gravatar for Neilfws
7.6 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

It's good to learn that there are resources like STRING which can help solve this common problem.

I would just make a general point: "names" are inherently ambiguous. Not just because there are many - even many synonyms for one object - but because of factors beyond your control: misspelling, erratic use of upper versus lower case and so on. This makes any kind of name-based search very difficult, which is why identifiers (accessions, official symbols) are preferred. It's often easier to query using IDs and retrieve names than the other way around.

ADD COMMENTlink written 7.6 years ago by Neilfws48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1679 users visited in the last hour