Question: How Within R, Using Xpath And Xml Package, Can I Select Nodes (Getnodeset) Based On Their Value?
0
gravatar for User56
7.4 years ago by
User56100
User56100 wrote:

This is a follow up on this question http://biostar.stackexchange.com/questions/17333/is-there-an-r-library-similar-to-libraries-like-bioperl-biopython-or-bioruby-m

This is a problem in R using XML package. I have 2 pubmed articles and I need to select only certain IDS. Only from certain databases I can not crack how to specify search by element value using XPath in R.

Here is my code:

#this PMID has has GOE IDs
url1="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=21558518&retmode=xml"
#this PMID has has Clnical Trials

 url2="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=21830967&retmode=xml"
 xml1 = xmlTreeParse(url1,useInternal = T)
 xml2 = xmlTreeParse(url2,useInternal = T)
 ns1 <- getNodeSet(xml1, '//DataBank/DataBankName')  
 ns2 <- getNodeSet(xml2, '//DataBank/DataBankName')
 ns1
 ns2

I need to modify the XPath to only select where DataBankName is (='ClinicalTrials.gov' or ='ISRCTN') URL which shows ISRCNT is this one

 url3="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=21675889&retmode=xml"

I need the IDs from the element stored in accession list:

(ns <- getNodeSet(xml1, '//DataBank'))

It looks like this:

<DataBank>
  <DataBankName>GEO</DataBankName>
  <AccessionNumberList>
    <AccessionNumber>GSE25055</AccessionNumber>
    <AccessionNumber>GSE25065</AccessionNumber>
    <AccessionNumber>GSE25066</AccessionNumber>
  </AccessionNumberList>
</DataBank>

I tried several ways how to match XPath based an element value but could not solve it. (any other solution, bypassing XPath is fine too)

Here is what I need (but it gives me error)

ns <- getNodeSet(xml1, '//DataBank/DataBankName[text()="ClinicalTrials.gov" or text()="ISRCTN"]/../AccessionNumberList/AccessionNumber')
R xml pubmed • 21k views
ADD COMMENTlink modified 7.4 years ago by Chris Maloney330 • written 7.4 years ago by User56100
2
gravatar for Chris Maloney
7.4 years ago by
Chris Maloney330
Bethesda, MD
Chris Maloney330 wrote:

I don't have R, so I can't try this, but this might also work (simplified slightly from your example):

ns <- getNodeSet(xml1, 
  '//DataBank[DataBankName="ClinicalTrials.gov" or 
              DataBankName="ISRCTN"]
   /AccessionNumberList/AccessionNumber')
ADD COMMENTlink written 7.4 years ago by Chris Maloney330

Yes, tried it: returns an XMLNodeSet with the 2 accession numbers (from xml3).

ADD REPLYlink written 7.4 years ago by Neilfws48k

Thanks.Yes. that is smart. does not require backtracking to the parent. I did not see that in any XPath examples on the net.

ADD REPLYlink written 7.4 years ago by User56100
1
gravatar for Neilfws
7.4 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

Would you consider a non-XPath solution?

The XML package has a couple of useful functions; xmlToList() and xmlToDataFrame(). These can convert the XML to native R data structures, which can be easier to work with within R.

Something like this code - which also uses llply from the plyr package to put the accession numbers into a new list.

library(XML)
library(plyr)
url3 <- http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=21675889&retmode=xml"
xml3 <- xmlTreeParse(url3, useInternal = T)
# convert to list
l <- xmlToList(xml3)

# should really check for existence of DataBankName
# but we'll leave that for now

if(l$PubmedArticle$MedlineCitation$Article$DataBankList$DataBank$DataBankName == "ISRCTN") {
  accn <- llply(l$PubmedArticle$MedlineCitation$Article$DataBankList$DataBank$AccessionNumberList)
}

print(accn)
# $AccessionNumber
# [1] "ISRCTN78147026"

# $AccessionNumber
# [1] "ISRCTN87739946"

It looks unwieldy, but the "$" notation for accessing list elements is helpful, once you see how the XML maps to the list.

ADD COMMENTlink written 7.4 years ago by Neilfws48k

Thanks for pointing those functions. Any solution is fine.

ADD REPLYlink written 7.4 years ago by User56100
1
gravatar for User56
7.4 years ago by
User56100
User56100 wrote:

There was a typo in my code from the question - extra " and also plural for trial(s).

The last piece of code actually works (repeated here)

(ns <- getNodeSet(xml3, '//DataBank/DataBankName[text() = "ClinicalTrials.gov" or text() = "ISRCTN"]/../AccessionNumberList/AccessionNumber'))
ADD COMMENTlink written 7.4 years ago by User56100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1101 users visited in the last hour