Biopython's Esearch for Pubmed does not give the same results as web search
2
0
Entering edit mode
3.6 years ago
yz • 0

Currently I am using Biopython's Esearch to get the list of papers for a searchterm. Unfortunately I'm getting different results when I compare them to the web search results. I have already tried to use the sort function but it does not help. The total number of search results also differs.

For Example if I am using the following code to search for "sclerosis":

from Bio import Entrez
def search(query):
    Entrez.email = 'example@mail.com'
    handle = Entrez.esearch(db='pubmed',
                            sort='pub date',
                            retmax='10',
                            retmode='xml',
                            term=query)
    results = Entrez.read(handle)
    print(results['Count'])
    return results

def fetch_details(id_list):
    ids = ','.join(id_list)
    Entrez.email = 'example@mail.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

if __name__ == '__main__':
    results = search('sclerosis')
    id_list = results['IdList']
    papers = fetch_details(id_list)
    for i, paper in enumerate(papers['PubmedArticle']):
        print("%d) %s" % (i + 1, paper['MedlineCitation']['Article']['ArticleTitle']))

Output-

  1. Therapeutic potential of neuromodulation for demyelinating diseases.
  2. Astaxanthin Reduces Demyelination and Oligodendrocytes Death in A Rat Model of Multiple Sclerosis.
  3. AI-Based Methods and Technologies to Develop Wearable Devices for Prosthetics and Predictions of Degenerative Diseases.
  4. RNA Editing in Neurological and Neurodegenerative Disorders.
  5. Neuromuscular junction mitochondrial enrichment: a "double-edged sword" underlying the selective motor neuron vulnerability in amyotrophic lateral sclerosis.
  6. Fused in sarcoma-amyotrophic lateral sclerosis as a novel member of DNA single strand break diseases with pure neurological phenotypes.
  7. Mending the broken in amyotrophic lateral sclerosis: DNA damage and repair in motor neuron degeneration.
  8. Cognitive impairment in multiple sclerosis: lessons from cerebrospinal fluid biomarkers.
  9. Reorganization of multiple sclerosis health care system in Clinical Centre of Montenegro during the COVID-19 pandemic.
  10. Mélange intéressante: COVID-19, autologous transplants and multiple sclerosis.

However I try to sort the results on the websearch or on my code below but I don't get the same results.

biopython pubmed python3.x Esearch python • 2.4k views
ADD COMMENT
0
Entering edit mode

I am not sure if there is anything you can do about that. I see that a search with "sclerosis" via web brings 162,112 hits (as of now) but if I do the search via EntrezDirect I see 161858 hits. It is possible that the database searched by the webpage is newer.

$ esearch -db pubmed -query "sclerosis"
<ENTREZ_DIRECT>
  <Db>pubmed</Db>
  <WebEnv>MCID_5f526a1a84f12c7b3075dbe0</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>161858</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

What is the ultimate aim of your search? You could achieve what you need to with right combination of search terms.

ADD REPLY
0
Entering edit mode

That is exactly my problem. The goal is to use lda topic modeling to identify latent topics. For this purpose it would be perfect to get all available papers. I already had the same thought with the actuality of the data. But I have not yet found a way to prove this.

ADD REPLY
0
Entering edit mode
2.8 years ago
lbehringer • 0

I can think of two factors that might cause different results between Biopython and the web search:

  1. Depending on how specific the query you give Biopython is, it will be translated before retrieving results. Example: <sclerosis> will be translated to <"sclerosis"[MeSH Terms] OR "sclerosis"[All Fields]>
  2. As GenoMax pointed out, the database version that Biopython is using might be older than that of the webpage.

You can find out what your query is translated to as well as the database build and last update as follows:

from Bio import Entrez

def search(query):
    Entrez.email = 'example@mail.com'
    handle = Entrez.esearch(db='pubmed',
                            sort='pub date',
                            retmax='10',
                            retmode='xml',
                            term=query)
    results = Entrez.read(handle)
    print('Count: ' + results['Count'])
    print('QueryTranslation: ' + results['QueryTranslation'])
    return results

def get_info(db):
    Entrez.email = 'example@mail.com'
    handle = Entrez.einfo(db=db)
    results = Entrez.read(handle)
    print('DbBuild: ' + results['DbInfo']['DbBuild'])
    print('LastUpdate: ' + results['DbInfo']['LastUpdate'])
    return results['DbInfo']

def fetch_details(id_list):
    ids = ','.join(id_list)
    Entrez.email = 'example@mail.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

if __name__ == '__main__':
    query = 'sclerosis'
    results = search(query)
    db_info = get_info('pubmed')
    id_list = results['IdList']
    papers = fetch_details(id_list)
    for i, paper in enumerate(papers['PubmedArticle']):
        print("%d) %s" % (i + 1, paper['MedlineCitation']['Article']['ArticleTitle']))

Output:

Count: 170232

QueryTranslation: "sclerosis"[MeSH Terms] OR "sclerosis"[All Fields]

DbBuild: Build210622-2217m.2

LastUpdate: 2021/06/23 06:55

1) Fibrosis as a common trait in amyotrophic lateral sclerosis tissues.

2) Lower and upper motor neuron involvement and their impact on disease prognosis in amyotrophic lateral sclerosis.

3) Predictive value of sub classification of focal segmental glomerular sclerosis in Oxford classification of IgA nephropathy.

4) Bushen Yijing Decoction (BSYJ) exerts an anti-systemic sclerosis effect via regulating MicroRNA-26a /FLI1 axis.

5) Hodgkin lymphoma involving extranodal sites in head and neck: report of twenty-nine cases and review of three-hundred and fifty-seven cases.

6) Galangin ameliorates experimental autoimmune encephalomyelitis in mice via modulation of cellular immunity.

7) 11C-PK11195 plasma metabolization has the same rate in multiple sclerosis patients and healthy controls: a cross-sectional study.

8) Multiple sclerosis: why we should focus on both sides of the (auto)antibody.

9) Teriflunomide provides protective properties after oxygen-glucose-deprivation in hippocampal and cerebellar slice cultures.

10) Neuroimmune connections between corticotropin-releasing hormone and mast cells: novel strategies for the treatment of neurodegenerative diseases.

Comparing the Biopython and web search results for the translated query, I get 170,232 vs. 170,426 results. The top 10 results are the same, albeit in a slightly different order.

ADD COMMENT
0
Entering edit mode

To follow up on this - Entrezdirect seems to be doing an "All Fields" search by default.

$ esearch -db pubmed -query "sclerosis"
<ENTREZ_DIRECT>
  <Db>pubmed</Db>
  <WebEnv>MCID_60d46b608b0f250da23005bd</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>170232</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

$ esearch -db pubmed -query "sclerosis [All Fields]"
<ENTREZ_DIRECT>
  <Db>pubmed</Db>
  <WebEnv>MCID_60d46b4441c0016e924a7a51</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>170232</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

$ esearch -db pubmed -query "sclerosis [MeSH Terms]"
<ENTREZ_DIRECT>
  <Db>pubmed</Db>
  <WebEnv>MCID_60d46b4d1849d92f73526f2f</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>8940</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

$ esearch -db pubmed -query "sclerosis [MeSH Terms] OR sclerosis [All Fields]"
<ENTREZ_DIRECT>
  <Db>pubmed</Db>
  <WebEnv>MCID_60d46c899d99b66b2851c4ee</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>170232</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>
ADD REPLY
0
Entering edit mode
2.5 years ago
Lilia • 0

This update in April 2022 might help https://ncbiinsights.ncbi.nlm.nih.gov/2021/10/05/updated-pubmed-api/

ADD COMMENT
0
Entering edit mode

NCBI will give some (esearch will now return exactly the same PubMed IDs (PMIDs) that are currently returned by web PubMed) but then take an important functionality away (only first 10K results will be accessible). I don't know if that counts as help.

ADD REPLY

Login before adding your answer.

Traffic: 2996 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6