Question: How do I get full text AND MeSH terms from Entrez on the PMC database in biopython?
0
gravatar for Syncrossus
12 months ago by
Syncrossus10
Syncrossus10 wrote:

I need to get full text articles as well as their MeSH terms from Pubmed central using Biopython's implementation of the E-utilities. So far, I have :

search_results = Entrez.read(Entrez.esearch(db="pmc",
                                            term=search_query,
                                            retmax=10,
                                            usehistory="y"))

My search queryis such that I get only open access medline articles about some subject from the pubmed central database. When I download articles, I use efetch like so :

handle = Entrez.efetch(db="pmc",
                       rettype="full",
                       retmode="xml",
                       retstart=start,
                       retmax=max,
                       webenv=search_results["WebEnv"],
                       query_key=search_results["QueryKey"])

So in my experience, the only way to get full text is with retmode="xml". rettype="full" or rettype="medline" doesn't seem to change much. My problem is I can't seem to get MeSH terms with these settings and I can't seem to get the full text with any other settings. Do you know if I'm missing something? Are MeSH terms not in a <MeshHeadingList> tag? Do PMC's open access articles not have MeSH terms associated to them?

ADD COMMENTlink modified 12 months ago • written 12 months ago by Syncrossus10

first search db=pubmed, get the mesh terms and extract the pmc identifier and then download the PMC article using another efetch.

ADD REPLYlink written 12 months ago by Pierre Lindenbaum119k

I'm not certain I follow. I should use esearch() with db="pubmed" and then call efetch twice, once with db="pubmed" and once with db="pmc"? Does that mean that the same articles hosted on different databases have different metadata? Why on earth would that be the case? Furthermore, how do I limit my search to PMC's "open access" section on pubmed?

ADD REPLYlink written 12 months ago by Syncrossus10

let's check the pubmed DTD:

$ wget -q -O - "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_180101.dtd" | grep -i mesh
(...)
   !ELEMENT MeshHeading (DescriptorName, QualifierName*)
   !ELEMENT MeshHeadingList (MeshHeading+)
   !ELEMENT SupplMeshList (SupplMeshName+)
   !ELEMENT SupplMeshName (#PCDATA) 
   !ATTLIST SupplMeshName

how about the pmc dtd ?

$ wget -q -O - "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd" | grep -i mesh

(nothing)

Why on earth would that be the case?

ask ncbi

Furthermore, how do I limit my search to PMC's "open access" section on pubmed?

unless I'm wrong, PMC is a "free full-text archive"

ADD REPLYlink modified 12 months ago • written 12 months ago by Pierre Lindenbaum119k

Thank you for all of your help.

unless I'm wrong, PMC is a "free full-text archive"

For my purposes, I need to use the open access Subset :

The articles in the [Open Access] Subset are made available under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work. [...] The majority of the articles in PMC are subject to traditional copyright restrictions and are not part of this subset.

I'm trying the opposite -- getting the Pubmed article from the PMID in the PMC articles. It seems to have been working on small batches and I've found some MeSH terms but I have 45k articles to go through so we'll know for sure when that's done.

ADD REPLYlink written 12 months ago by Syncrossus10
2
gravatar for Syncrossus
12 months ago by
Syncrossus10
Syncrossus10 wrote:

It turns out PMC articles don't contain MeSH terms. What worked for me, after doing my search in the PMC database and downloading all the XML files, was extracting the PMID field in each XML file with pmid = tree.findall(".//article-id[@pub-id-type='pmid']") where tree is an ElementTree. I then used these PMIDs to download the articles from PubMed : handle = Entrez.efetch(db="pubmed", id=i, rettype="full", retmode="xml") for all i in my list of PMIDs. I could then extract the MeSH with meshtags = tree.findall(".//MeshHeadingList/MeshHeading/*"). All that's left at this point is a little processing onmeshtags.

Thanks to Pierre Lindenbaum for his helpful information, and to my colleague for pointing out the PMIDs.

ADD COMMENTlink modified 11 months ago • written 12 months ago by Syncrossus10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 956 users visited in the last hour