Question: Looking for a faster way to get doi of pubmed publications
1
gravatar for n.anuragsharma
7 months ago by
n.anuragsharma30 wrote:

Is there a faster way to get pubmed doi using an input list of PMIDs?

Alternatively is there a way to do all of this in R Bioconductor?

import metapub
from Bio import Entrez, Medline

# Function to get the number of publications matching a query term
def pubmed_publications_count(term):
    Entrez.email = "n.anuragsharma@gmail.com"  # My email ID
    handle = Entrez.esearch(db="pubmed", term=term, rettype='count')
    record = Entrez.read(handle)
    handle.close()
    count = record["Count"]
    return count

# Function to get all the PMIDs matching a query term
def pubmed_publications_ids(term):
    count = pubmed_publications_count(term)
    Entrez.email = "n.anuragsharma@gmail.com"  # My email ID
    handle = Entrez.esearch(db="pubmed", term=term,retmax = count)
    record = Entrez.read(handle)
    handle.close()
    idlist = record["IdList"]
    return idlist

When I'm working with a small number of queries the wait time seems alright:

term = "Arabidopsis hypotcotyl PIF dark root"
idlist = pubmed_publications_ids(term) # returns 3 PMIDs
doi = [metapub.FindIt(Id).doi for Id in idlist]

But if I'm working with anything greater than 10 or so as in this, it takes forever to get what I need:

term = "Arabidopsis hypotcotyl PIF dark"
idlist = pubmed_publications_ids(term) # returns 44 PMIDs
doi = [metapub.FindIt(Id).doi for Id in idlist]

What I've tried so far, all of which throw an error:

# Passing the PMIDs as a comma separated string as is indicated here [The E-utilities In-Depth: Parameters, Syntax and More][1]:  
term = "Arabidopsis hypotcotyl PIF dark root"
idlist = pubmed_publications_ids(term)
idlist = ",".join(idlist)
doi = metapub.FindIt(idlist)
---ValueError: invalid literal for int() with base 10: '28970478,25763615,24279300'

# Passing the PMIDs as a semi-colon separated string as is indicated in this SO comment [comment by @Maximilian Peters][2]:
term = "Arabidopsis hypotcotyl PIF dark root"
idlist = pubmed_publications_ids(term)
idlist = ";".join(idlist)
doi = metapub.FindIt(idlist)
---ValueError: invalid literal for int() with base 10: '28970478;25763615;24279300'

# I can't even get it to work with the input as a list:
term = "Arabidopsis hypotcotyl PIF dark root"
idlist = pubmed_publications_ids(term)
doi = metapub.FindIt(idlist)
---TypeError: int() argument must be a string, a bytes-like object or a number, not 'ListElement'

Is there any way around writing a loop and instead submit a whole list of PMIDs instead? Alternatively any other package which has clearer documentation would help. If I have missed out on any existing documentation for this problem, I apologise in advance!

Edit ---------------------------------------------------------------------------------------------------------------------------------------------------

From the maintainer of Metapub:

One thing is that you don't want to be using FindIt just for doi lookups -- FindIt is slow because it goes looking for the PDFs of papers as well!

You should try this function:

from metapub.text_mining import pmid2doi

... pmid2doi is (currently) the fastest way to accurately get a DOI for a PMID.

And as she said, my code does indeed work out a lot faster. I'm unable to wrap my head around how you time a function in Python in an IDE so I haven't added any here.

Also, there is a similar package for R, easyPubMed, however I find that the options are rather limited at the moment.

pubmed medline metapub bio python • 437 views
ADD COMMENTlink modified 7 months ago • written 7 months ago by n.anuragsharma30
1

NCBI publishes annual PubMed data dumps (and incremental updates) on their FTP site. You could look at getting that data and parsing it locally. Should go much faster but may lag behind most current info.

You could also look at using EntrezDirect instead of python.

ADD REPLYlink modified 7 months ago • written 7 months ago by GenoMax95k

I'll check them out thanks.

ADD REPLYlink written 7 months ago by n.anuragsharma30

Aren't you performing several queries, first one query with "term", and then several other queries, one with each pubmed id? Of course this will be very slow with a growing number of records. Can't you store the full result of the first query locally, and extract the information from this local object?

I am not very familiar with Python, so unfortunately, I can't help with code.

ADD REPLYlink written 7 months ago by h.mon32k
3
gravatar for h.mon
7 months ago by
h.mon32k
Brazil
h.mon32k wrote:

Using EUtils / EDirect:

term="spiroplasma AND male killing"
esearch -db pubmed -query "$term" \
  | efetch -format xml \
  | xtract -pattern PubmedArticle -PMID MedlineCitation/PMID \
    -block ArticleId -if @IdType -equals doi \
    -encode "&PMID",ArticleId

This is a modified version of the query found at Entrez Direct: E-utilities on the UNIX Command Line. Note that not all pmids will have a doi, in this case, they will be skipped.

ADD COMMENTlink written 7 months ago by h.mon32k

Thanks! This helps a lot.

ADD REPLYlink written 7 months ago by n.anuragsharma30

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLYlink written 7 months ago by GenoMax95k

You are welcome. It seems you are also considering R solutions, did you consider the rentrez package?

ADD REPLYlink written 7 months ago by h.mon32k

This is excellent! Yes I primarily use R but somehow had not come across rentrez, and while looking up its vignette I accidentally stumbled across RISmed. Much obliged, thanks!

ADD REPLYlink written 7 months ago by n.anuragsharma30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1213 users visited in the last hour
_