rentrez function to get the data from pubmed db

Question

Infer publication date by Pubmed ID

0

Entering edit mode

7.4 years ago

Giovanni M Dall'Olio 28k

I have a table of several thousands of pubmed ids, and I wonder if there is a smart way to infer the publication date for each of them.

My first thought was to search for a table somewhere with a column of the pubmed id together with the publication date. However, since the pubmed ids are associated sequentially, I wonder if it would be enough to just get the min/max pmid for every year, and infer the publication date by looking for the correct interval.

Has anyone ever faced a similar calculation? Which database would you use for this calculation?

pmid year pubmed • 3.1k views

ADD COMMENT • link 7.2 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

since the pubmed ids are associated sequentially, I wonder if it would be enough to just get the min/max pmid for every year, and infer the publication date by looking for the correct interval.

that's not always the case...

pmid: 13054692 https://www.ncbi.nlm.nih.gov/pubmed/13054692 (1953)
pmid: 12054692 https://www.ncbi.nlm.nih.gov/pubmed/12054692 (2002)

ADD REPLY • link 7.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I see! That's too bad, it means I really need to get a table then.

ADD REPLY • link 7.4 years ago by Giovanni M Dall'Olio 28k

score 2 · Answer 1 · 2017-01-26

Thanks everybody.

Just for reference, I've decided to download the whole medline and parse the file locally, as I wanted to avoid making hundreds of thousands query updated.

First, I've downloaded the files from ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline

Since these are xml files, I've extracted the publication date using this XSLT template:




<xsl:stylesheet version="1.0" xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="text" encoding="UTF-8"/>

<xsl:template match="PubmedArticle">
        <xsl:value-of select="MedlineCitation/PMID"/>,<xsl:value-of select="MedlineCitation/Article/Journal/JournalIssue/PubDate/Year"/>,<xsl:value-of select="MedlineCitation/DateCompleted/Year"/>
</xsl:template>

</xsl:stylesheet>

Then, transformed all the xml files using GNU/parallel and xsltproc. This provided a number of txt files containing three columns (pmid, date of record creation, and date of publication completed), which was merged and formatted with an R script, to get a final 2-columns file with pmid and year.

Interesting fact: I can now officially demonstrate that the PMID does not directly correlate with the publication date, e.g. two papers with consecutive PMID may have been published in completely different years.

score 1 · Answer 2 · 2017-01-15

I kind of compiled and assembled this script hope it can help you in a way

library(RISmed) 
library(rentrez)
library(XML)
search_topic <- ' ' #specify your query.
search_query <- EUtilsSummary(search_topic, retmax=100, mindate=2010, maxdate=2016) # give the time line as you need.
QueryId(search_query)

your.ids <- print(paste(QueryId(search_query)))

rentrez function to get the data from pubmed db

fetch.pubmed <- entrez_fetch(db = "pubmed", id = your.ids,
                         rettype = "xml", parsed = T)

Extract the Abstracts for the respective IDS.

abstracts = xpathApply(fetch.pubmed, //PubmedArticle//Article', function(x) xmlValue(xmlChildren(x)$Abstract))'

Change the abstract names with the IDS.

names(abstracts) <- your.ids
abstracts
dim(col.abstracts)
write.csv(col.abstracts, file = "abs.csv")

score 0 · Answer 3 · 2016-12-14

0

Entering edit mode

7.4 years ago

bongok ▴ 40

Have you tried E-utilities? https://www.ncbi.nlm.nih.gov/books/NBK25497/

Fetch pubmed IDs in XML format and write a script to parse out the date. https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch

ADD COMMENT • link 7.4 years ago by bongok ▴ 40

0

Entering edit mode

That would be a possibility, but I need to do it basically for all the papers ever published! It would be a bit overkill, specially considering that with a SQL query I could simply calculate the max and min pmid per year. How would you structure such a query with the eutils?

ADD REPLY • link 7.4 years ago by Giovanni M Dall'Olio 28k

score 0 · Answer 4 · 2016-12-14

0

Entering edit mode

7.3 years ago

WouterDeCoster 47k

It would require many requests to entrez, but you could probably do this with biopython. (not exactly a 'smart' way, but it would work)

ADD COMMENT • link 7.3 years ago by WouterDeCoster 47k