How do I get full text AND MeSH terms from Entrez on the PMC database in biopython?
1
0
Entering edit mode
6.0 years ago
Syncrossus ▴ 10

I need to get full text articles as well as their MeSH terms from Pubmed central using Biopython's implementation of the E-utilities. So far, I have :

search_results = Entrez.read(Entrez.esearch(db="pmc",
                                            term=search_query,
                                            retmax=10,
                                            usehistory="y"))

My search queryis such that I get only open access medline articles about some subject from the pubmed central database. When I download articles, I use efetch like so :

handle = Entrez.efetch(db="pmc",
                       rettype="full",
                       retmode="xml",
                       retstart=start,
                       retmax=max,
                       webenv=search_results["WebEnv"],
                       query_key=search_results["QueryKey"])

So in my experience, the only way to get full text is with retmode="xml". rettype="full" or rettype="medline" doesn't seem to change much. My problem is I can't seem to get MeSH terms with these settings and I can't seem to get the full text with any other settings. Do you know if I'm missing something? Are MeSH terms not in a <MeshHeadingList> tag? Do PMC's open access articles not have MeSH terms associated to them?

python biopython pubmed entrez e-utilities • 6.5k views
ADD COMMENT
0
Entering edit mode

first search db=pubmed, get the mesh terms and extract the pmc identifier and then download the PMC article using another efetch.

ADD REPLY
0
Entering edit mode

I'm not certain I follow. I should use esearch() with db="pubmed" and then call efetch twice, once with db="pubmed" and once with db="pmc"? Does that mean that the same articles hosted on different databases have different metadata? Why on earth would that be the case? Furthermore, how do I limit my search to PMC's "open access" section on pubmed?

ADD REPLY
0
Entering edit mode

let's check the pubmed DTD:

$ wget -q -O - "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_180101.dtd" | grep -i mesh
(...)
   !ELEMENT MeshHeading (DescriptorName, QualifierName*)
   !ELEMENT MeshHeadingList (MeshHeading+)
   !ELEMENT SupplMeshList (SupplMeshName+)
   !ELEMENT SupplMeshName (#PCDATA) 
   !ATTLIST SupplMeshName

how about the pmc dtd ?

$ wget -q -O - "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd" | grep -i mesh

(nothing)

Why on earth would that be the case?

ask ncbi

Furthermore, how do I limit my search to PMC's "open access" section on pubmed?

unless I'm wrong, PMC is a "free full-text archive"

ADD REPLY
0
Entering edit mode

Thank you for all of your help.

unless I'm wrong, PMC is a "free full-text archive"

For my purposes, I need to use the open access Subset :

The articles in the [Open Access] Subset are made available under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work. [...] The majority of the articles in PMC are subject to traditional copyright restrictions and are not part of this subset.

I'm trying the opposite -- getting the Pubmed article from the PMID in the PMC articles. It seems to have been working on small batches and I've found some MeSH terms but I have 45k articles to go through so we'll know for sure when that's done.

ADD REPLY
2
Entering edit mode
6.0 years ago
Syncrossus ▴ 10

It turns out PMC articles don't contain MeSH terms. What worked for me, after doing my search in the PMC database and downloading all the XML files, was extracting the PMID field in each XML file with pmid = tree.findall(".//article-id[@pub-id-type='pmid']") where tree is an ElementTree. I then used these PMIDs to download the articles from PubMed : handle = Entrez.efetch(db="pubmed", id=i, rettype="full", retmode="xml") for all i in my list of PMIDs. I could then extract the MeSH with meshtags = tree.findall(".//MeshHeadingList/MeshHeading/*"). All that's left at this point is a little processing onmeshtags.

Thanks to Pierre Lindenbaum for his helpful information, and to my colleague for pointing out the PMIDs.

ADD COMMENT

Login before adding your answer.

Traffic: 1492 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6