Question

Automated Literature Search For List Of Genes

3

Entering edit mode

10.9 years ago

Davy ▴ 410

I have a smallish list of genes that I need to do some literature searching on. There are about 80 of them so individually searching for each one and all their aliases would be much too time consuming. Does anyone know of a tool I could use to search pubmed (or other database) with a list of genes, other than building a very long query by hand?

literature genetics • 6.0k views

ADD COMMENT • link updated 10.4 years ago by reachtoskumar ▴ 10 • written 10.9 years ago by Davy ▴ 410

score 8 · Answer 1 · 2013-06-11

I modified a script from EUtilities years ago, without use BioPerl.

Please check the code and you can modify it to fit your need :-)

# 2010/11/29 
# pubfetch.pl
# Code modified based on Entrez Programming Utilities from PubMed
# http://eutils.ncbi.nlm.nih.gov/
# Usage: perl pubfetch.pl
use 5.010;
use LWP::Simple;
print "Please Enter The Keyword for Fetch: "; # ask for keyword to search
my $keyword = <> ;
chomp $keyword; 

my $year = 2000; # From which year to fetch
open COUNT,">$keyword.count.txt";
open RESULT,">$keyword.result.txt";

for ($year; $year<=2011; $year++){
    my $utils = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/";
    my $db     = "Pubmed";
    my $query  = $keyword;
    my $report = "abstract";
    my $esearch = "$utils/esearch.fcgi?" .
              "db=$db&retmax=1&usehistory=y&maxdate=$year&mindate=$year&term=";
#        say "$esearch$query";

    $output=get($esearch . $query);
    $hash{$year}=$output;
    my $web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
    my $key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
    my $count=$1 if ($output =~ /<eSearchResult><Count>(\d+)<\/Count>/);
#    say "$web $key $count";

    print "The total number of publication for $keyword in year $year is $count;\n";
    print COUNT "$year $count\n";

    if ( $count != 0 ){
        my $efetch = "$utils/efetch.fcgi?" .
               "rettype=$report&retmode=text&retstart=0&retmax=10000&" .
               "db=$db&query_key=$key&WebEnv=$web";


        my $efetch_result = get($efetch);

        print RESULT "$efetch_result";
    }  
}


my $idcount=0;
close RESULT;

open ID, "$keyword.result.txt";
open IDRESULT,">$keyword.IDlist.txt";

while (<ID>){
if (m/^PMID:\s(\d*)/){
    print IDRESULT "$1\n" ;
    push (my @array,"$1");
    $idcount++;

    }
}

print "the total number of the publication fetched is $idcount\n";

score 8 · Answer 2 · 2013-06-11

8

Entering edit mode

10.9 years ago

Pierre Lindenbaum 161k

a one-liner :-)

$ echo -e "NOTCH2\nPRKCB1" | while read G; do curl -s "http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=${G}" | xsltproc <(echo "<x:stylesheet xmlns:x="&lt;a href="http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform' version='1.0'><x:output method="text"/><x:template match="/">http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=text&rettype=abstract&id=<x:for-each select="eSearchResult/IdList/Id"><x:value-of select="."/>,</x:for-each>
</x:template></x:stylesheet>") -  | xargs curl -s "${U}" ; done

1. J Immunol. 2013 Jun 5. [Epub ahead of print]

Intrinsic Molecular Factors Cause Aberrant Expansion of the Splenic Marginal Zone
B Cell Population in Nonobese Diabetic Mice.

Stolp J, Mariño E, Batten M, Sierro F, Cox SL, Grey ST, Silveira PA.

Garvan Institute of Medical Research, Immunology Program, Darlinghurst, New South
Wales 2010, Australia.

ADD COMMENT • link 10.9 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

I've grouped all the Id into the same call for curl+efetch. That won't work for a large number of Id (big retmax) returned by esearch. But you could generate one URL per Id:

....  | xsltproc <(echo "<x:stylesheet xmlns:x="&lt;a href="http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform' version='1.0'><x:output method="text"/><x:template match="/"><x:for-each select="eSearchResult/IdList/Id">http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=text&rettype=abstract&id=<x:value-of select="."/>
</x:for-each></x:template></x:stylesheet>") -  | while read U; do curl -s "${U}" ; done

ADD REPLY • link 10.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

can the code for literature search work for other identifiers as well, for eg; rs ids or protein ids ?

ADD REPLY • link 10.9 years ago by NB ▴ 960

0

Entering edit mode

rs: use ncbi-elink, protein: yes but like the genes, beware the ambiguities

ADD REPLY • link 10.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

As usual, Pierre to the rescue!!!

ADD REPLY • link 10.9 years ago by Davy ▴ 410

0

Entering edit mode

Is it possible to display >20 items? Pubmed lets to see 200 items per page.

ADD REPLY • link 10.9 years ago by PoGibas 5.1k

0

Entering edit mode

yes, see the NCBI doc for esearch / retmax. See also my first comment.

ADD REPLY • link 10.9 years ago by Pierre Lindenbaum 161k

score 6 · Answer 3 · 2013-06-11

option 1: the Publication track in UCSC

The UCSC browser has recently included a new track, called Publications, containing literature relative to a gene. Thus, you can use the UCSC APIs to get all the references for a gene. For example, the following will get you all the references for the gene "CD97":

http://genome.ucsc.edu/cgi-bin/hgc?hgsid=338677393&c=chr19&o=14491955&t=14519537&g=pubsMarkerGene&i=CD97

I guess that you can also connect to the Mysql table, but I am not 100% sure that "articleId" field corresponds to the pubmed id:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select * from hgFixed.pubsMarkerAnnot where markerId="CD97" limit 10'

# select only the Ids (less verbose output)
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select distinct articleId, markerId from hgFixed.pubsMarkerAnnot where markerId="CD97" limit 10'

option 2: getting citations from Uniprot

Uniprot has some well curated citations for genes. You can get all the references for a list of genes by using the "Retrieve" tool from the Uniprot main page, and then parsing the RDF file.

option 3: use the eutils, but from another tool

If you do not want to spend time trying using the Bioperl (or Biopython) APIs to eutils, you can try this taverna workflow.

score 2 · Answer 4 · 2013-06-11

2

Entering edit mode

10.9 years ago

cts ★ 1.7k

Bioperl has this sort of functionality. I've never used it to query pubmed but the following website contains snippets to help you on your way: http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#Simple_database_query

I think what you're looking for is the esearch or efetch utilities.

ADD COMMENT • link 10.9 years ago by cts ★ 1.7k

0

Entering edit mode

Thanks but I was hoping to avoid having to use the BioPerl EUtilities. :( I really should get around to familiarising myself with them but I abandoned Perl a long time ago.

ADD REPLY • link 10.9 years ago by Davy ▴ 410

1

Entering edit mode

are you also adverse to the other "bio" packages, I believe that biopython/bioruby have similar functionality (although I've never used them)

ADD REPLY • link 10.9 years ago by cts ★ 1.7k

score 0 · Answer 5 · 2013-11-26

0

Entering edit mode

10.4 years ago

reachtoskumar ▴ 10

You can also give a try to BioGyan (http://www.biogyan.com/). It is a comprehensive search tool specially designed for biologists, enabling search, annotation and ranking of scientific literature from public databases. It can accept multiple Genes.

ADD COMMENT • link 10.4 years ago by reachtoskumar ▴ 10