I wish to process full-texts of many articles using europepmc::epmc_ftxt (so that I can later use tidypmc::pmc_text and tidypmc::separate_text). I find that the R coding (a) below is much too slow in the first step before processing: mypmc is a vector of PMCID numbers e.g. c("PMC11102434", "PMC11127444", etc. I've used the loop and error handling to stop the process from stopping with an error - but perhaps this could be faster (?)
## (a) mypmc is a vector of PMCID numbers e.g. c("PMC11102434", ...
library("europepmc")
docs <- list();
for (i in 1:1000) {
docs[[i]] <- tryCatch(europepmc::epmc_ftxt(mypmc[[i]]), error = function(e) {NA})
}
Although ideally I would like to only download those from a list of PMCID numbers, an alternative would be to batch download: at https://europepmc, ord/ftp/oa/ there are many .gz files which possibly I could access one at a time in a loop and then search for my PMCID numbers:
filez <- "PMC13900_PMC17829.xml.gz"
url <- paste0("https://europepmc.org/ftp/oa/", filez)
tf <- paste0(mypath, "result.xml.gz")
download.file(url, tf)
## doc <- XML::xmlTreeParse(tf[1])
doc <- XML::xmlParse(tf[1])
## doc <- XML::xmlInternalTreeParse(tf[1])
saveXML(doc, file = "output.xml")
The above coding definitely works - because I can open the .xml file created ! The problem is that I cannot use the output.xml with R europepmc or tidypmc. europepmc::epmc_ftxt(paste0(mypath, "output.xml")) says "Please provide one PMCID, i.e. id starting with 'PMC'", tidypmc::pmc_text() says "doc should be an xml_document from PubMed Central" - which of course it is ! Any ideas ?
sessionInfo() R version 4.4.0 (2024-04-24) Platform: aarch64-apple-darwin20 Running under: macOS Sonoma 14.4
Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Berlin tzcode source: internal