Question

Given A Gene, Identify The World Experts

17

Entering edit mode

13.4 years ago

Andrew Su 4.9k

The goal of the Gene Wiki project is to create a collaboratively-written, community-reviewed, and continuously-updated review article for every gene in the human genome. So far, community contributions to these articles is going quite well based only on word-of-mouth and the Google factor.

However, we potentially could accelerate the effort if we actively recruited experts for each gene. So the question is -- given a gene, what is the best automated method to identify the world experts? Bonus points for any methods to also identify their email address. (The fall back there will be to use Amazon Mechanical Turk.)

NOTE: this question is to solicit ideas of how this would be done, but if anyone wants to work with us to actually implement this, we'd be happy to include you as an author on a future Gene Wiki paper.

EDIT: Since my comment to Pierre's answer is hidden, reposting it here. Pierre's solution is pretty awesome. The only thing that I can see right now to make it better is to put it on Google App Engine. Any GAE experts in the crowd want to comment on how hard that would be?

gene literature • 5.2k views

ADD COMMENT • link updated 7.2 years ago by Biostar 20 • written 13.4 years ago by Andrew Su 4.9k

Ram · Answer 1 · 2010-12-14

15

Entering edit mode

13.4 years ago

Pierre Lindenbaum 161k

Andrew, why on earth are you asking this when it's 00H01 here. It's time to sleep for me !:-) However here is how I would do:

for each Gene, search the XML definition of the Gene using NCBI-ESearch and NCBI-EFetch
Search for all the publication associated to that entry by searching the tags `<pubmedid>`
Download the pubmed record as XML and extract the name of each author in the paper and their affiliation (it often contains the email).
Get the name of the most frequent author (this is the most difficult part here because the names can be ambiguous )

Note: In 2007, I collected the names and the emails of some bioinformaticians by scanning pubmed with java. See my post.

UPDATE: OK, I've quickly written a program doing the job. It is available on GIST at :

Here is an excerpt from the output for 3 genes : ZC3H7B, EIF4G1 and PRNP.

<?xml version="1.0" encoding="UTF-8"?>
<experts>
  <gene name="ZC3H7B" geneId="23264" count-pmids="13">
    <Person>
      <firstName>Sumio</firstName>
      <lastName>Sugano</lastName>
      <pmid>8125298</pmid>
      <pmid>9373149</pmid>
      <pmid>14702039</pmid>
      <affilitation>International and Interdisciplinary Studies, The University of Tokyo, Japan.</affilitation>
      <affilitation>Institute of Medical Science, University of Tokyo, Japan.</affilitation>
      <affilitation>Helix Research Institute, 1532-3 Yana, Kisarazu, Chiba 292-0812, Japan.</affilitation>
    </Person>
  </gene>
  <gene name="eif4G1" geneId="1981" count-pmids="106">
    <Person>
      <firstName>Nahum</firstName>
      <lastName>Sonenberg</lastName>
      <pmid>7651417</pmid>
      <pmid>7935836</pmid>
      <pmid>8449919</pmid>
      (...)
      <affilitation>Department of Biochemistry and McGill Cancer Center, McGill University, Montreal, H3G 1Y6, Quebec, Canada.</affilitation>
      <affilitation>Department of Biochemistry, McGill University, Montreal, Quebec, Canada.</affilitation>
      <affilitation>Laboratories of Molecular Biophysics, The Rockefeller University, New York, New York 10021, USA.</affilitation>
      (...)
    </Person>
  </gene>
  <gene name="PRNP" geneId="5621" count-pmids="429">
    <Person>
      <firstName>John</firstName>
      <lastName>Collinge</lastName>
      <pmid>1352724</pmid>
      <pmid>1677164</pmid>
      <pmid>2159587</pmid>
      <pmid>20583301</pmid>
      (...)
      <mail>j.collinge@ic.ac.uk</mail>
      <affilitation>Krebs Institute for Biomolecular Research, Department of Molecular Biology and Biotechnology, University of Sheffield, Sheffield S10 2TN, UK.</affilitation>
      <affilitation>MRC Prion Unit and Department of Neurogenetics, Imperial College School of Medicine at St. Mary's, London, United Kingdom. J.Collinge@ic.ac.uk</affilitation>
      <affilitation>Division of Neuroscience (Neurophysiology), Medical School, University of Birmingham, Edgbaston, Birmingham, UK. sratte@pitt.edu</affilitation>
    (...)
    </Person>
  </gene>
</experts>

In the case of ZC3H7B, the result is wrong. Dr Sugano (3 articles) just used this Gene in a set of other Genes. The expert would be D. Poncet, my former thesis advisor but his number of articles about this protein is 2 articles.

Eif4G1: I know that Dr Sonenberg is the expert. His email wasn't found.

PRNP: Collinge seems to be the expert. His e-mail was found too.

Update the code

https://gist.github.com/lindenb/740496

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.4 years ago by Pierre Lindenbaum 161k

4

Entering edit mode

Actually, journal impact factor has nothing to do with the importance of individual articles. Common misconception :-) See http://altmetrics.org/manifesto/.

ADD REPLY • link 13.4 years ago by Neilfws 49k

1

Entering edit mode

@Larry, most(all?) Genes in GeneWiki are Human Genes.

ADD REPLY • link 13.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

A serious issue that emerges is gene synonyms. One may need to consider species as well because the same gene in different organisms will function differently. So, Pierre's step 2 needs some refinement but nonetheless gets my vote.

ADD REPLY • link 13.4 years ago by Larry_Parnell 16k

0

Entering edit mode

Pierre -- Science doesn't sleep, so neither should you... ;)

ADD REPLY • link 13.4 years ago by Andrew Su 4.9k

0

Entering edit mode

Nice Q/A. You could also take into account the journal impact factor. This way the authors would be ranked by additionally relying on the "quality" of their work. Also, attention should be given to avoid mail receivers to mark the mail as spam.

ADD REPLY • link 13.4 years ago by Mns ▴ 20

0

Entering edit mode

As usual, I'm thoroughly impressed. I wonder how difficult it would be to adapt this to Google App Engine so we could call it as a web service. (On my first attempt to run locally, I get an error that is likely due to my complete java ignorance...) Any GAE experts in the audience?

ADD REPLY • link 13.4 years ago by Andrew Su 4.9k

0

Entering edit mode

I don't think that Google App engine would be the best place to run this service: there is a lot of I/O and it could be slow (for example in PRNP , 429 article were downloaded )

ADD REPLY • link 13.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Bummer, was wondering if that would be an issue (as it is with my pubmed2wordle app). Anyway, thanks!

ADD REPLY • link 13.4 years ago by Andrew Su 4.9k

0

Entering edit mode

Amazing answer!!! Refining with Gene synonyms, Journal impact factor, Article views or downloads, Grants obtained by author (if any), works on similar genes, publication of invited reviews....

ADD REPLY • link 13.4 years ago by Rm 8.3k

score 5 · Answer 2 · 2010-12-13

5

Entering edit mode

13.4 years ago

Larry_Parnell 16k

Interesting question and project! We tried a similar approach back in 1998 when annotating the Arabidopsis genome. We were Christine Schueller, then at MIPS, several plant biologists who knew genes/gene families and wanted to get their hands on the sequence data prior to GenBank/EMBL deposits, and myself.

I feel that any method needs to consider more than one expert, even more than one identified person per paper, because the last author may not have to want to spend the time on this or the paper represents a collaborative effort that really has distributed expertise.

Who qualifies as an expert. If we are the only lab to publish on a given LOC1234.. gene, we may be the experts with very little data. Are we also experts on APOA5, publishers of 21 of the 314 papers on this gene? Perhaps, certainly in terms of genetic variants and the response to fat in the diet. Len Pennacchio was the first author on the first APOA5 paper but this is no longer an interest of his nor of Ed Rubin, in whose lab he worked at the time. In other words, as the number of pubs grows, expertise becomes fragmented and specialized.

All this is not much of an answer but more in terms of guidance based on what I've seen. This is a hard question to answer satisfactorily.

ADD COMMENT • link 13.4 years ago by Larry_Parnell 16k

0

Entering edit mode

Nice answer, thanks Larry. I'll follow up to say that we indeed are looking to identify multiple experts. In fact, the beauty (and the curse) of the Gene Wiki is that anyone can edit these gene pages. So our goal is certainly not to anoint one person to be the authority, but simply to notify a few relevant people that the Gene Wiki page exists...

ADD REPLY • link 13.4 years ago by Andrew Su 4.9k

0

Entering edit mode

Larry, I guess Andrew is just looking for a few people/experts that would be qualified for editing an article about 'their' gene in wikipedia.

ADD REPLY • link 13.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

That's all clear. My point is one of the difficulty that arises as more knowledge about a gene is published. Identifying when an expert or experts for an emerging role of that gene is required can be difficult.

ADD REPLY • link 13.4 years ago by Larry_Parnell 16k

score 3 · Answer 3 · 2010-12-14

It might be interesting to take a two-phase approach - set up a system where people can nominate themselves as experts for a gene, and advertise the feature on forums, blogs, etc. Supplement this by automatically identifying candidate experts, and drawing their attention to the opt-in system via a mailshot.

I would agree with Pierre's approach for identifying candidates - search for papers on PubMed containing the name of the gene in the title/abstract. It might be useful to include known synonyms, e.g. from HGNC.

The PubMed XML contains a distinct element for each author, but afaik there is usually only one Affiliation element that would typically contain a single e-mail address. Complications include the fact that the element seems to be free-text with no standard format, and there is no definitive way to associate the e-mail address with a specific author. Even so, I would pluck out the address with the basic someone@somewhere.something pattern.

I would ignore the authors, since I assume you don't have time to manually find 10,000+ e-mail addresses, or even to verify them if they were automatically found. Instead, focus on the one e-mail address per paper, and consolidate them for the mailshot. For each gene, take only the 50 most common e-mail addresses and send the mail with "Dear Colleague". Your scripts should be able to customise each mail to list the gene (or genes) that the e-mail address is associated with.

score 3 · Answer 4 · 2010-12-14

If you want to mine the literature to suggest candidates, you may want to distinguish between reviews and primary literature. Reviews, particularly at high-impact journals (e.g. Nature Reviews) are often invited by the editors, and the senior authors are typically either someone with a strong track record or someone who has recently published an interesting result.

The position of authors is also informative; someone who was senior or first author is more likely to be an expert on the overall science presented by the paper.

You may also need to consider how someone could be an expert in the role of a gene in fly development but may know little about its role in colon cancer. Saying someone is a "Myc expert" usually means in a particular context.

Those are obvious points, but they aren't clearly stated in the above notes.

If you spam thousands of scientists with an invitation to write for any website, no matter how virtuous, you are likely to invite a backlash. Getting some prominent proponents on board and getting them to spread the word organically may be a better approach.

score 2 · Answer 5 · 2010-12-14

2

Entering edit mode

13.4 years ago

Giovanni M Dall'Olio 28k

well... a possible solution is to look at WikiGenes, which automatically creates a template for each gene by automatic analysis of the literature.

For example, you can look at the entry for a specific gene and get the list of authors from there.

p.s. why don't you try to collaborate with WikiGenes and the other efforts? For users, it is disturbing to see so many resources spent and no communication between similar projects.

ADD COMMENT • link 13.4 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Using the text-mined and referenced content in WikiGenes is an interesting idea. Will definitely think about it. More generally, we have friendly communications with others in this sphere. But ultimately each group has a slightly different model, and we all believe that they are all worth trying out...

ADD REPLY • link 13.4 years ago by Andrew Su 4.9k