Accessing PubMed Central full-texts via FTP?
0
0
Entering edit mode
14 months ago

I wish to process full-texts of many articles using europepmc::epmc_ftxt (so that I can later use tidypmc::pmc_text and tidypmc::separate_text). I find that the R coding (a) below is much too slow in the first step before processing: mypmc is a vector of PMCID numbers e.g. c("PMC11102434", "PMC11127444", etc. I've used the loop and error handling to stop the process from stopping with an error - but perhaps this could be faster (?)

## (a) mypmc is a vector of PMCID numbers e.g. c("PMC11102434", ...
library("europepmc")
docs <- list();  
for (i in 1:1000) {
docs[[i]] <- tryCatch(europepmc::epmc_ftxt(mypmc[[i]]), error = function(e) {NA})
}

Although ideally I would like to only download those from a list of PMCID numbers, an alternative would be to batch download: at https://europepmc, ord/ftp/oa/ there are many .gz files which possibly I could access one at a time in a loop and then search for my PMCID numbers:

filez <- "PMC13900_PMC17829.xml.gz"
url <- paste0("https://europepmc.org/ftp/oa/", filez)
tf <- paste0(mypath, "result.xml.gz")
download.file(url, tf)
## doc <- XML::xmlTreeParse(tf[1]) 
doc <- XML::xmlParse(tf[1]) 
## doc <- XML::xmlInternalTreeParse(tf[1])
saveXML(doc, file = "output.xml")

The above coding definitely works - because I can open the .xml file created ! The problem is that I cannot use the output.xml with R europepmc or tidypmc. europepmc::epmc_ftxt(paste0(mypath, "output.xml")) says "Please provide one PMCID, i.e. id starting with 'PMC'", tidypmc::pmc_text() says "doc should be an xml_document from PubMed Central" - which of course it is ! Any ideas ?

sessionInfo() R version 4.4.0 (2024-04-24) Platform: aarch64-apple-darwin20 Running under: macOS Sonoma 14.4

Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin tzcode source: internal

PMCID epmc_ftxt pmc europepmc • 454 views
ADD COMMENT

Login before adding your answer.

Traffic: 3792 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6