Question: Entrez Direct E-utilities - using match and xtract to filter by data value
0
gravatar for al-ash
3.1 years ago by
al-ash110
Japan/Okinawa/OIST
al-ash110 wrote:

I'm tryin to use Entrez direct to extract "Gene-commentary_accession" information from xml file using:

esearch -db gene -query XP_003399880.1| efetch -format xml | xtract -pattern Gene-commentary  -match Gene-commentary_type:1 -element Gene-commentary_accession > Bter_FAR_genome_shotgun_sequences2.txt

an example of XML file (shortened):

  <Entrezgene_locus>
    <Gene-commentary>
      <Gene-commentary_type value="genomic">1</Gene-commentary_type>
      <Gene-commentary_heading>Reference Bter_1.0</Gene-commentary_heading>
      <Gene-commentary_label>Chromosome LG B12 Reference Bter_1.0</Gene-commentary_label>
      <Gene-commentary_accession>NC_015773</Gene-commentary_accession>
      <Gene-commentary_version>1</Gene-commentary_version>
      <Gene-commentary_seqs>
        <Seq-loc>
          <Seq-loc_int>
            <Seq-interval>
              <Seq-interval_from>7277254</Seq-interval_from>
              <Seq-interval_to>7286174</Seq-interval_to>
              <Seq-interval_strand>
                <Na-strand value="minus"/>
              </Seq-interval_strand>
              <Seq-interval_id>
                <Seq-id>
                  <Seq-id_gi>339751241</Seq-id_gi>
                </Seq-id>
              </Seq-interval_id>
            </Seq-interval>
          </Seq-loc_int>
        </Seq-loc>
      </Gene-commentary_seqs>
      <Gene-commentary_products>
        <Gene-commentary>
          <Gene-commentary_type value="mRNA">3</Gene-commentary_type>
          <Gene-commentary_heading>Reference</Gene-commentary_heading>
          <Gene-commentary_label>transcript variant X1</Gene-commentary_label>
          <Gene-commentary_accession>XM_003399832</Gene-commentary_accession>
          <Gene-commentary_version>2</Gene-commentary_version>
          <Gene-commentary_genomic-coords>

I'd like to retrieve the genomic accession using -match command but I still keep extracting also other Gene-commentary_accessions such as "mRNA" - could you help me with a correct syntax?

(I find it quite difficult to comprehend the use of -match from the NCBI's documentation for this topic (https://www.ncbi.nlm.nih.gov/books/NBK179288/) so another example on Biostars might possibly help also others with similar question.)

ADD COMMENTlink modified 3.0 years ago by DCGenomics320 • written 3.1 years ago by al-ash110

(deleted - misplaced comment)

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by al-ash110
1
gravatar for DCGenomics
3.0 years ago by
DCGenomics320
United States
DCGenomics320 wrote:

Xtract 5.50, part of today's EDirect release, has better methods for handling recursive objects, with two specific improvements:

1) Nested exploration (e.g., "*/Gene-commentary") masks deeper objects from being seen by the -element selection command. It is no longer necessary to use -first instead of -element to exclude information from lower levels.

2) Recursive exploration (e.g., "**/Gene-commentary") flattens the recursive structure, visiting every indicated object regardless of depth. The same -element masking applies here.

In addition, the -match and -avoid commands, along with the "object:value" selection construct, have been deprecated, so that colon can be used to indicate namespace prefixes.

Conditional execution now uses -if and -unless commands, and has compound statements for string comparison (e.g., -contains) or numeric comparison (e.g., -lt).

Retrieving genomic accessions from Bombus terrestris can be done with:

   esearch -db gene -query XP_003399880.1 |
   efetch -format xml |
   xtract -pattern Entrezgene -block "**/Gene-commentary" \
     -if Gene-commentary_type@value -equals genomic \
       -tab "\n" -element Gene-commentary_accession |
   sort | uniq

This returns two accessions:

   AELG01001811
   NC_015773

Note that the efetch.fcgi "id" argument should have rejected a non-integer (accession) value sent to the gene database. This oversight has been reported to the program's maintainers. EDirect's efetch front-end now issues an error message if an accession is passed to -id and the -db argument is not a sequence database.

Please update to the latest version of EDirect by rerunning the download instructions in:

https://www.ncbi.nlm.nih.gov/books/NBK179288/

ADD COMMENTlink modified 3.0 years ago by genomax73k • written 3.0 years ago by DCGenomics320
0
gravatar for Pierre Lindenbaum
3.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

I'd like to retrieve the genomic accession using -match command but I still keep extracting also other Gene-commentary_accessions such as "mRNA" - could you help me with a correct syntax?

yes, because Gene-commentary can contain some other Gene-commentary_type;

Use a xslt stylesheet instead of edirect or an xpath expr see below, :

$ curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=XP_003399880.1&retmode=xml&rettype=db" | xmllint --xpath '//Gene-commentary[Gene-commentary_type/text()="1" ]/Gene-commentary_accession' - | cat | tr "<>" "\n" | grep -vF 'Gene-commentary_accession' | grep -v '^$' | sort | uniq
AC_000062
AC_000151
AC010642
AC012313
AMYH02036533
CH471135
CP000040
NC_000019
NC_007103
NC_018930

(ugly, a xslt stylesheet would be better)

ADD COMMENTlink written 3.1 years ago by Pierre Lindenbaum124k

Pierre, thanks for suggesting an alternative solution.

I'm aware of the multiple gene commentary types and therefore I tried to specify it by using -match Gene-commentary_type:1 but apparently I don't have the syntax right. I still hope to make it working because otherwise I'm satisfied with edirect and it works for my other tasks.

Btw. in your code, I'm wondering what is going on with the link https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=XP_003399880.1&retmode=xml&rettype=db because it leads to an xml of a human protein despite XP_003399880.1 being completely different insect protein (https://www.ncbi.nlm.nih.gov/protein/XP_003399880.1) ?

ADD REPLYlink written 3.1 years ago by al-ash110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1454 users visited in the last hour