Question

How Within R, Using Xpath And Xml Package, Can I Select Nodes (Getnodeset) Based On Their Value?

0

Entering edit mode

12.2 years ago

User56 ▴ 100

This is a follow up on this question http://biostar.stackexchange.com/questions/17333/is-there-an-r-library-similar-to-libraries-like-bioperl-biopython-or-bioruby-m

This is a problem in R using XML package. I have 2 pubmed articles and I need to select only certain IDS. Only from certain databases I can not crack how to specify search by element value using XPath in R.

Here is my code:

#this PMID has has GOE IDs
url1="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=21558518&retmode=xml"
#this PMID has has Clnical Trials

 url2="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=21830967&retmode=xml"
 xml1 = xmlTreeParse(url1,useInternal = T)
 xml2 = xmlTreeParse(url2,useInternal = T)
 ns1 <- getNodeSet(xml1, '//DataBank/DataBankName')  
 ns2 <- getNodeSet(xml2, '//DataBank/DataBankName')
 ns1
 ns2

I need to modify the XPath to only select where DataBankName is (='ClinicalTrials.gov' or ='ISRCTN') URL which shows ISRCNT is this one

 url3="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=21675889&retmode=xml"

I need the IDs from the element stored in accession list:

(ns <- getNodeSet(xml1, '//DataBank'))

It looks like this:

<DataBank>
  <DataBankName>GEO</DataBankName>
  <AccessionNumberList>
    <AccessionNumber>GSE25055</AccessionNumber>
    <AccessionNumber>GSE25065</AccessionNumber>
    <AccessionNumber>GSE25066</AccessionNumber>
  </AccessionNumberList>
</DataBank>

I tried several ways how to match XPath based an element value but could not solve it. (any other solution, bypassing XPath is fine too)

Here is what I need (but it gives me error)

ns <- getNodeSet(xml1, '//DataBank/DataBankName[text()="ClinicalTrials.gov" or text()="ISRCTN"]/../AccessionNumberList/AccessionNumber')

r xml pubmed • 24k views

ADD COMMENT • link updated 12.2 years ago by Chris Maloney ▴ 360 • written 12.2 years ago by User56 ▴ 100

score 2 · Answer 1 · 2012-02-15

2

Entering edit mode

12.2 years ago

Chris Maloney ▴ 360

I don't have R, so I can't try this, but this might also work (simplified slightly from your example):

ns <- getNodeSet(xml1, 
  '//DataBank[DataBankName="ClinicalTrials.gov" or 
              DataBankName="ISRCTN"]
   /AccessionNumberList/AccessionNumber')

ADD COMMENT • link 12.2 years ago by Chris Maloney ▴ 360

0

Entering edit mode

Yes, tried it: returns an XMLNodeSet with the 2 accession numbers (from xml3).

ADD REPLY • link 12.2 years ago by Neilfws 49k

0

Entering edit mode

Thanks.Yes. that is smart. does not require backtracking to the parent. I did not see that in any XPath examples on the net.

ADD REPLY • link 12.2 years ago by User56 ▴ 100

score 1 · Answer 2 · 2012-02-14

Would you consider a non-XPath solution?

The XML package has a couple of useful functions; xmlToList() and xmlToDataFrame(). These can convert the XML to native R data structures, which can be easier to work with within R.

Something like this code - which also uses llply from the plyr package to put the accession numbers into a new list.

library(XML)
library(plyr)
url3 <- http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=21675889&retmode=xml"
xml3 <- xmlTreeParse(url3, useInternal = T)
# convert to list
l <- xmlToList(xml3)

# should really check for existence of DataBankName
# but we'll leave that for now

if(l$PubmedArticle$MedlineCitation$Article$DataBankList$DataBank$DataBankName == "ISRCTN") {
  accn <- llply(l$PubmedArticle$MedlineCitation$Article$DataBankList$DataBank$AccessionNumberList)
}

print(accn)
# $AccessionNumber
# [1] "ISRCTN78147026"

# $AccessionNumber
# [1] "ISRCTN87739946"

It looks unwieldy, but the "$" notation for accessing list elements is helpful, once you see how the XML maps to the list.

score 1 · Answer 3 · 2012-02-14

1

Entering edit mode

12.2 years ago

User56 ▴ 100

There was a typo in my code from the question - extra " and also plural for trial(s).

The last piece of code actually works (repeated here)

(ns <- getNodeSet(xml3, '//DataBank/DataBankName[text() = "ClinicalTrials.gov" or text() = "ISRCTN"]/../AccessionNumberList/AccessionNumber'))

ADD COMMENT • link 12.2 years ago by User56 ▴ 100