Extracting Xml Output From Genbank() Query
2
0
Entering edit mode
14.1 years ago
Andrew ▴ 10

I'm interested in extracting the sig_peptide field from:

http://www.ncbi.nlm.nih.gov/nuccore/NM_000610.3

I'm using the genbank() function in the annotate package as follows:

library(annotate)
answer <- genbank('NM_000610', disp='data', type='uid')

However, I'm having some difficulty in using the XML package to parse out the sig_peptide field. Any pointers or suggestions are appreciated, as I'm new to the XML package.

bioconductor xml r • 4.4k views
ADD COMMENT
3
Entering edit mode
14.1 years ago
Neilfws 49k

Part of your problem: the XML returned by the genbank() method is different to that returned by Pierre's EUtils query and does not, in fact, contain the tag sig_peptide.

If we do:

top <- xmlRoot(answer)
getNodeSet(top, "//GBSeq_feature-table")

We see that no nodes are returned:

list()
attr(,"class")
[1] "XMLNodeSet"

Whereas if we try:

feats <- getNodeSet(top, "//Seq-feat")

61 nodes are returned:

length(feats)
[1] 61

In fact, the XML returned by genbank() is the same as that obtained by visiting the NCBI web page for the sequence and sending the file to XML. Examination of that file using e.g. grep:

grep sig_peptide Downloads/sequence.xml
# no matches

shows that this XML does not contain sig_peptide.

So first part of the answer: do not use the genbank() method from library(annotate). To get the correct XML, containing sig_peptide, use:

library(RCurl)
xml <- getURL("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_000610&rettype=gb&retmode=xml")
doc <- xmlTreeParse(xml, useInternalNodes = T)

Now, we can get the features from that XML file into a list:

top <- xmlRoot(doc)
feats <- getNodeSet(top, "//GBSeq_feature-table//GBFeature")
length(feats)
# [1] 54

And then use lapply and xmlSApply to extract values for features:

values <- lapply(feats, function(x) xmlSApply(x, xmlValue))

Examination of this new list shows the sig_peptide is element 6:

names(values[[6]])
# [1] "GBFeature_key"       "GBFeature_location"  "GBFeature_intervals"
# [4] "GBFeature_quals"

values[[6]]["GBFeature_key"]
# GBFeature_key 
# "sig_peptide"

values[[6]]["GBFeature_location"]
# GBFeature_location 
#       "435..494"

You can also get the feature nodes into a data frame:

df1 <- xmlToDataFrame(feats)

And grep for "sig_peptide":

df1[grep("sig_", df1$GBFeature_key), 1:2]
# GBFeature_key   GBFeature_location
# 6   sig_peptide           435..494

Final hint: the R XML documentation is dreadful, so don't despair when trying to learn it. It's very difficult for the experienced, never mind beginners.

ADD COMMENT
0
Entering edit mode

Thanks for the details and the leg work. All very helpful!

ADD REPLY
0
Entering edit mode

Cool; FWIW, xmlInternalTreeParse("http://...") works without RCurl. The xpath query node <- getNodeSet(xml, '//GBFeature[GBFeature_key/text()="sig_peptide"]')[[1]] selects the relevant node, and xpathApply(node, ".//GBQualifier_name") is the syntax for querying the content of node.

ADD REPLY
0
Entering edit mode

Thanks for the tips Martin.

ADD REPLY
2
Entering edit mode
14.1 years ago

I can't resist to give a XSLT solution with the following stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform"
    >
<xsl:output method="text"/>

<xsl:template match="/">
 <xsl:apply-templates select="GBSet/GBSeq"/>
</xsl:template>

<xsl:template match="GBSeq">
<xsl:variable name="acn"><xsl:value-of select="GBSeq_locus"/></xsl:variable>
<xsl:variable name="seq"><xsl:value-of select="GBSeq_sequence"/></xsl:variable>
<xsl:for-each select="GBSeq_feature-table/GBFeature[GBFeature_key='sig_peptide']/GBFeature_intervals/GBInterval[GBInterval_from ][GBInterval_to]">
<xsl:variable name="x0" select="number(GBInterval_from)"/>
<xsl:variable name="x1" select="number(GBInterval_to)"/>
<xsl:text>></xsl:text>
<xsl:value-of select="concat($acn,'|',$x0,'-',$x1)"/>
<xsl:text>
</xsl:text>
<xsl:value-of select="substring($seq,$x0,$x1 - $x0 + 1)"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>

</xsl:stylesheet>

usage:

xsltproc --novalid stylesheet.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_000610&rettype=gb&retmode=xml"

>NM_000610|435-494
atggacaagttttggtggcacgcagcctggggactctgcctcgtgccgctgagcctggcg
ADD COMMENT
0
Entering edit mode

Thanks. If there is an answer in R, that would be even better!

ADD REPLY
0
Entering edit mode

Aside from not being an R solution, there is a problem with this: the XML returned by the EUtils query is not the same as that returned by the genbank() method in library(annotate). In fact, the latter does not even contain the tag sig_peptide.

ADD REPLY

Login before adding your answer.

Traffic: 3295 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6