Question

Extracting Xml Output From Genbank() Query

0

Entering edit mode

12.6 years ago

Andrew ▴ 10

I'm interested in extracting the sig_peptide field from:

http://www.ncbi.nlm.nih.gov/nuccore/NM_000610.3

I'm using the genbank() function in the annotate package as follows:

library(annotate)
answer <- genbank('NM_000610', disp='data', type='uid')

However, I'm having some difficulty in using the XML package to parse out the sig_peptide field. Any pointers or suggestions are appreciated, as I'm new to the XML package.

bioconductor xml r • 3.8k views

ADD COMMENT • link updated 12.6 years ago by Neilfws 49k • written 12.6 years ago by Andrew ▴ 10

score 3 · Answer 1 · 2011-09-22

Part of your problem: the XML returned by the genbank() method is different to that returned by Pierre's EUtils query and does not, in fact, contain the tag sig_peptide.

If we do:

top <- xmlRoot(answer)
getNodeSet(top, "//GBSeq_feature-table")

We see that no nodes are returned:

list()
attr(,"class")
[1] "XMLNodeSet"

Whereas if we try:

feats <- getNodeSet(top, "//Seq-feat")

61 nodes are returned:

length(feats)
[1] 61

In fact, the XML returned by genbank() is the same as that obtained by visiting the NCBI web page for the sequence and sending the file to XML. Examination of that file using e.g. grep:

grep sig_peptide Downloads/sequence.xml
# no matches

shows that this XML does not contain sig_peptide.

So first part of the answer: do not use the genbank() method from library(annotate). To get the correct XML, containing sig_peptide, use:

library(RCurl)
xml <- getURL("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_000610&rettype=gb&retmode=xml")
doc <- xmlTreeParse(xml, useInternalNodes = T)

Now, we can get the features from that XML file into a list:

top <- xmlRoot(doc)
feats <- getNodeSet(top, "//GBSeq_feature-table//GBFeature")
length(feats)
# [1] 54

And then use lapply and xmlSApply to extract values for features:

values <- lapply(feats, function(x) xmlSApply(x, xmlValue))

Examination of this new list shows the sig_peptide is element 6:

names(values[[6]])
# [1] "GBFeature_key"       "GBFeature_location"  "GBFeature_intervals"
# [4] "GBFeature_quals"

values[[6]]["GBFeature_key"]
# GBFeature_key 
# "sig_peptide"

values[[6]]["GBFeature_location"]
# GBFeature_location 
#       "435..494"

You can also get the feature nodes into a data frame:

df1 <- xmlToDataFrame(feats)

And grep for "sig_peptide":

df1[grep("sig_", df1$GBFeature_key), 1:2]
# GBFeature_key   GBFeature_location
# 6   sig_peptide           435..494

Final hint: the R XML documentation is dreadful, so don't despair when trying to learn it. It's very difficult for the experienced, never mind beginners.

score 2 · Answer 2 · 2011-09-21

I can't resist to give a XSLT solution with the following stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform"
    >
<xsl:output method="text"/>

<xsl:template match="/">
 <xsl:apply-templates select="GBSet/GBSeq"/>
</xsl:template>

<xsl:template match="GBSeq">
<xsl:variable name="acn"><xsl:value-of select="GBSeq_locus"/></xsl:variable>
<xsl:variable name="seq"><xsl:value-of select="GBSeq_sequence"/></xsl:variable>
<xsl:for-each select="GBSeq_feature-table/GBFeature[GBFeature_key='sig_peptide']/GBFeature_intervals/GBInterval[GBInterval_from ][GBInterval_to]">
<xsl:variable name="x0" select="number(GBInterval_from)"/>
<xsl:variable name="x1" select="number(GBInterval_to)"/>
<xsl:text>></xsl:text>
<xsl:value-of select="concat($acn,'|',$x0,'-',$x1)"/>
<xsl:text>
</xsl:text>
<xsl:value-of select="substring($seq,$x0,$x1 - $x0 + 1)"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>

</xsl:stylesheet>

usage:

xsltproc --novalid stylesheet.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_000610&rettype=gb&retmode=xml"

>NM_000610|435-494
atggacaagttttggtggcacgcagcctggggactctgcctcgtgccgctgagcctggcg