Question

Tutorial:Using easyPubMed and scholar package to get all citations of your paper

0

Entering edit mode

16 months ago

rohitsatyam102 ▴ 920

This is a tutorial on downloading all the citations for the articles present in any Google Scholar Profile.

Use Case

You might want to do it for your CV or help a friend.
Update your lab website with the latest publication list.
You might want to add all your published citations to your thesis

For me, the objective was 2nd and I wanted to automate this process, so here is the code:

library(scholar)
library(easyPubMed)
library(anytime)
library(dplyr)
library(plyr)

## Get all the papers published by me using my Google Scholar ID

profile_url <- "Kz82pUgAAAAJ&hl"
profile <- get_profile(profile_url)
publications <- get_publications(profile,sortby="year") 
publications <- publications[order(publications$year, decreasing = TRUE),] ## sorting by year
publications <- subset(publications, !(journal %in% 
c("BioRxiv","biorxiv","bioRxiv","","medRxiv","Authorea Preprints"))) ##removing preprints, feel free to extend the list if you use other preprint servers

## remove supplementary material
publications<- publications[!grepl("Supplementa|Author Correction",publications$title),]

## removing duplicate papers using the paper titles
publications <- publications[!duplicated(publications$title),]
publications_top30 <- publications[1:30,] ##choosing top 30 articles to report.

## We have most of the information, but we don't have publication dates. We need
## dates to sort publications based on them so that the latest paper shows up first (in reverse chronology).
## Let's obtain this information from PubMed

## The cleanFun cleans the html string and convert the string into dates
cleanFun <- function(htmlString) {
  return(anytime::anydate(stringr::str_trim(gsub("<.*?>", " ", htmlString))))
}

## I call dates and dois separately since I encounter some error but will revise the code in future to reduce runtime
dates <- lapply(1:nrow(publications_top30),function(x){
  req <- get_pubmed_ids_by_fulltitle(publications_top30[x,]$title, field = "[Title]")
  my_xml <- fetch_pubmed_data(req, retmax = 2)
  date <- cleanFun(custom_grep(my_xml, tag = "PubDate") %>% unlist)
  print(paste("fetched date for paper",x))
  return(date[1])
})

dois <- lapply(1:nrow(publications_top30),function(x){
  req <- get_pubmed_ids_by_fulltitle(publications_top30[x,]$title, field = "[Title]")
  my_xml <- fetch_pubmed_data(req, retmax = 2)
  if(is.null(my_xml)){
    doi<-NA
  } else{
    tt<-article_to_df(my_xml,max_chars = 18)
    doi <- unique(tt$doi)[1]
  }

  print(paste("fetched date for paper",x))
  return(doi)
})

## adding the dates and dois to make the citation
publications_top30$date <- plyr::ldply(dates)$V1
publications_top30$doi <- plyr::ldply(dois)$V1

# uncomment it if you want to remove the papers without publication dates
#publications_top30 <- publications_top30[!(is.na(publications_top30$date)),]

## ordering articles by year and then date
publications_top30<-publications_top30[
  with(publications_top30, order(year, date,decreasing = TRUE)),
]

## Converting everything to strings
pub_list_temp <- lapply(1:nrow(publications_top30), function(x) {
  pub <- publications_top30[x,]
  paste(
    gsub("\\.\\.\\.", "et al.",pub$author),
    pub$title,
    pub$journal,
    pub$number,
    pub$year,
    pub$doi, sep = ", "
  )
})

df<- ldply(pub_list_temp)

write.table(df, file = "references.txt", sep = "\t", row.names = FALSE,quote = FALSE,
            col.names = FALSE)

The code requires finishing and I used paste function rather than functions from bibtex or RefManageR because II was not able to figure out how to convert the bibtex object to strings in R. If you have suggestions on how this code can be improved, feel free to chime in!!

Cheers!!

scholar easyPubMed • 704 views

ADD COMMENT • link 16 months ago by rohitsatyam102 ▴ 920