Question

Looking for a faster way to get doi of pubmed publications

1

Entering edit mode

5.1 years ago

n.anuragsharma ▴ 40

Is there a faster way to get pubmed doi using an input list of PMIDs?

Alternatively is there a way to do all of this in R Bioconductor?

import metapub
from Bio import Entrez, Medline

# Function to get the number of publications matching a query term
def pubmed_publications_count(term):
    Entrez.email = "n.anuragsharma@gmail.com"  # My email ID
    handle = Entrez.esearch(db="pubmed", term=term, rettype='count')
    record = Entrez.read(handle)
    handle.close()
    count = record["Count"]
    return count

# Function to get all the PMIDs matching a query term
def pubmed_publications_ids(term):
    count = pubmed_publications_count(term)
    Entrez.email = "n.anuragsharma@gmail.com"  # My email ID
    handle = Entrez.esearch(db="pubmed", term=term,retmax = count)
    record = Entrez.read(handle)
    handle.close()
    idlist = record["IdList"]
    return idlist

When I'm working with a small number of queries the wait time seems alright:

term = "Arabidopsis hypotcotyl PIF dark root"
idlist = pubmed_publications_ids(term) # returns 3 PMIDs
doi = [metapub.FindIt(Id).doi for Id in idlist]

But if I'm working with anything greater than 10 or so as in this, it takes forever to get what I need:

term = "Arabidopsis hypotcotyl PIF dark"
idlist = pubmed_publications_ids(term) # returns 44 PMIDs
doi = [metapub.FindIt(Id).doi for Id in idlist]

What I've tried so far, all of which throw an error:

# Passing the PMIDs as a comma separated string as is indicated here [The E-utilities In-Depth: Parameters, Syntax and More][1]:  
term = "Arabidopsis hypotcotyl PIF dark root"
idlist = pubmed_publications_ids(term)
idlist = ",".join(idlist)
doi = metapub.FindIt(idlist)
---ValueError: invalid literal for int() with base 10: '28970478,25763615,24279300'

# Passing the PMIDs as a semi-colon separated string as is indicated in this SO comment [comment by @Maximilian Peters][2]:
term = "Arabidopsis hypotcotyl PIF dark root"
idlist = pubmed_publications_ids(term)
idlist = ";".join(idlist)
doi = metapub.FindIt(idlist)
---ValueError: invalid literal for int() with base 10: '28970478;25763615;24279300'

# I can't even get it to work with the input as a list:
term = "Arabidopsis hypotcotyl PIF dark root"
idlist = pubmed_publications_ids(term)
doi = metapub.FindIt(idlist)
---TypeError: int() argument must be a string, a bytes-like object or a number, not 'ListElement'

Is there any way around writing a loop and instead submit a whole list of PMIDs instead? Alternatively any other package which has clearer documentation would help. If I have missed out on any existing documentation for this problem, I apologise in advance!

Edit ---------------------------------------------------------------------------------------------------------------------------------------------------

From the maintainer of Metapub:

One thing is that you don't want to be using FindIt just for doi lookups -- FindIt is slow because it goes looking for the PDFs of papers as well!

You should try this function:

from metapub.text_mining import pmid2doi

... pmid2doi is (currently) the fastest way to accurately get a DOI for a PMID.

And as she said, my code does indeed work out a lot faster. I'm unable to wrap my head around how you time a function in Python in an IDE so I haven't added any here.

Also, there is a similar package for R, easyPubMed, however I find that the options are rather limited at the moment.

metapub Medline python pubmed Bio • 4.2k views

ADD COMMENT • link updated 3.7 years ago by xiaopeng990 • 0 • written 5.1 years ago by n.anuragsharma ▴ 40

1

Entering edit mode

NCBI publishes annual PubMed data dumps (and incremental updates) on their FTP site. You could look at getting that data and parsing it locally. Should go much faster but may lag behind most current info.

You could also look at using EntrezDirect instead of python.

ADD REPLY • link 5.1 years ago by GenoMax 152k

0

Entering edit mode

I'll check them out thanks.

ADD REPLY • link 5.1 years ago by n.anuragsharma ▴ 40

0

Entering edit mode

Aren't you performing several queries, first one query with "term", and then several other queries, one with each pubmed id? Of course this will be very slow with a growing number of records. Can't you store the full result of the first query locally, and extract the information from this local object?

I am not very familiar with Python, so unfortunately, I can't help with code.

ADD REPLY • link 5.1 years ago by h.mon 35k

0

Entering edit mode

Current pmid2doi for metapub is below. But it is still not that efficient.

from metapub.convert import pmid2doi

dois = [pmid2doi(pmid) for pmid in df_papers['pmid']]

ADD REPLY • link 3.7 years ago by xiaopeng990 • 0

score 3 · Accepted Answer · 2020-06-05

3

Entering edit mode

5.1 years ago

h.mon 35k

Using EUtils / EDirect:

term="spiroplasma AND male killing"
esearch -db pubmed -query "$term" \
  | efetch -format xml \
  | xtract -pattern PubmedArticle -PMID MedlineCitation/PMID \
    -block ArticleId -if @IdType -equals doi \
    -encode "&PMID",ArticleId

This is a modified version of the query found at Entrez Direct: E-utilities on the UNIX Command Line. Note that not all pmids will have a doi, in this case, they will be skipped.

ADD COMMENT • link 5.1 years ago by h.mon 35k

0

Entering edit mode

Thanks! This helps a lot.

ADD REPLY • link 5.1 years ago by n.anuragsharma ▴ 40

0

Entering edit mode

You are welcome. It seems you are also considering R solutions, did you consider the rentrez package?

ADD REPLY • link 5.1 years ago by h.mon 35k

0

Entering edit mode

This is excellent! Yes I primarily use R but somehow had not come across rentrez, and while looking up its vignette I accidentally stumbled across RISmed. Much obliged, thanks!

ADD REPLY • link 5.1 years ago by n.anuragsharma ▴ 40