Question: Entrez xtract "unrecognized argument '-match'" error
2
gravatar for Nancy Ouyang
4.4 years ago by
Nancy Ouyang170
United States
Nancy Ouyang170 wrote:

I can't figure out how to use the "-match" syntax even after reading all the documentation I could find. I get these errors:

$ cat xml.txt | xtract -pattern Gene-commentary -match Gene-commentary_type:1

Unrecognized argument '-match'
No -element before 'Gene-commentary_type:1'

$cat xml.txt | xtract -element Gene-commentary -match Gene-commentary_type:1

Unrecognized argument '-match'
No -element before 'Gene-commentary_type:1'

What am I doing wrong?

========

What I am trying to do is pull the accession of the reference sequence and the coordinates for the region for a given entry in NCBI Gene (see https://www.biostars.org/p/122522/) so that I can run $ efetch -format FASTA -seqstart -seqend and get the appropriate results.

I could parse the XML in python to do it, but it really seems like I should be able to do this in "one line" using entrez direct if only I could get "-match" to work :/

Here is what the XML looks like:

Say I have a gene record in XML

epost -db gene -id 672 | efetch -format xml > xml.txt

According to the outline,

cat xml.txt | xtract -outline

 <Gene-commentary>
      <Gene-commentary_type value="genomic">1</Gene-commentary_type>
      <Gene-commentary_heading>Reference assembly</Gene-commentary_heading>
      <Gene-commentary_label>RefSeqGene</Gene-commentary_label>
      <Gene-commentary_accession>NG_005905</Gene-commentary_accession>
      <Gene-commentary_version>2</Gene-commentary_version>
      <Gene-commentary_seqs>
        <Seq-loc>
          <Seq-loc_int>
            <Seq-interval>
              <Seq-interval_from>92500</Seq-interval_from>
              <Seq-interval_to>173688</Seq-interval_to>

 

I have read:

Attempting To Utilise The New Entrez Direct Package But Having Difficulty With Pubmed And Nucleotide Xml Parsing

http://www.ncbi.nlm.nih.gov/books/NBK179288/ (I followed these instructions to install it, so $which epost returns ~/edirect/epost)

http://www.ncbi.nlm.nih.gov/news/02-06-2014-entrez-direct-released/?campaign=facebook-02072014

http://elane.stanford.edu/laneconnex/public/media/documents/EntrezDirect.pdf
 

entrez ncbi • 1.6k views
ADD COMMENTlink modified 4.4 years ago by hpmcwill1.1k • written 4.4 years ago by Nancy Ouyang170
4
gravatar for hpmcwill
4.4 years ago by
hpmcwill1.1k
United Kingdom
hpmcwill1.1k wrote:

From a bit of experimentation with 'xtract' it appears that the order of the command-line arguments is important, and thus a use of '-match' must be followed by an '-element' option. This appears to be the source of the error message you receive.

Using just 'xtract' the closest I've gotten so far is:

cat xml.txt | edirect/xtract -pattern Gene-commentary -match 'Gene-commentary_type:1' -element 'Gene-commentary_accession' Seq-interval

You may be able to further anchor the patterns to make the extraction more specific.

 

ADD COMMENTlink written 4.4 years ago by hpmcwill1.1k

Thanks for answering my specific question! It's weird their error message says "before" :/ I wasn't sure who to give the checkmark to, but I think Pierre Lindenbaum answered my actual question.

ADD REPLYlink written 4.4 years ago by Nancy Ouyang170
2
gravatar for Pierre Lindenbaum
4.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

using a good old xslt stylesheet:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text"/>
  <xsl:template match="/">
    <xsl:for-each select="/Entrezgene-Set/Entrezgene/Entrezgene_locus">
      <xsl:for-each select="Gene-commentary[Gene-commentary_type/@value='genomic' and Gene-commentary_type/text()='1']">
        <xsl:variable name="acn">
          <xsl:value-of select="concat('(',Gene-commentary_heading,')',Gene-commentary_accession)"/>
        </xsl:variable>
        <xsl:for-each select="Gene-commentary_seqs/Seq-loc/Seq-loc_int/Seq-interval">
          <xsl:value-of select="concat($acn,':',Seq-interval_from,'-',Seq-interval_to)"/>
          <xsl:text>
</xsl:text>
        </xsl:for-each>
      </xsl:for-each>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

 

run:

 xsltproc --novalid transform.xsl  "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=672&retmode=xml"

output:

(Reference GRCh38 Primary Assembly)NC_000017:43044294-43125482
(Reference assembly)NG_005905:92500-173688
(Alternate CHM1_1.1)NC_018928:41431850-41513017
(Alternate HuRef)AC_000149:36962662-37043808

 

 

 

 

 

ADD COMMENTlink written 4.4 years ago by Pierre Lindenbaum119k
1

Hah, my initial reaction was "what sorcery is this?" Neat, I'd never heard of xslt before. Thanks!
 

ADD REPLYlink written 4.4 years ago by Nancy Ouyang170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 816 users visited in the last hour