NCBI filter out fields in entrez query with efetch
1
0
Entering edit mode
4.3 years ago

I am currently getting informations on sequences using NCBI entrez API. The url looks like : https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=JVEU01000013,HQ844023.1&rettype=gb&retmode=xml

and output looks like:

<?xml version="1.0" encoding="UTF-8"?>
<GBSet>
   <GBSeq>
      <GBSeq_locus>JVEU01000013</GBSeq_locus>
      <GBSeq_length>5266</GBSeq_length>
      <GBSeq_strandedness>double</GBSeq_strandedness>
      <GBSeq_moltype>DNA</GBSeq_moltype>
      <GBSeq_topology>linear</GBSeq_topology>
      <GBSeq_division>BCT</GBSeq_division>
      <GBSeq_update-date>10-JUL-2015</GBSeq_update-date>
      <GBSeq_create-date>10-JUL-2015</GBSeq_create-date>
      <GBSeq_definition>Stenotrophomonas maltophilia strain 498_SMAL 1015_5266_269573_11+,1127+,970+, whole genome shotgun sequence</GBSeq_definition>
      <GBSeq_primary-accession>JVEU01000013</GBSeq_primary-accession>
      <GBSeq_accession-version>JVEU01000013.1</GBSeq_accession-version>
      <GBSeq_other-seqids>
         <GBSeqid>gb|JVEU01000013.1|</GBSeqid>
         <GBSeqid>gnl|WGS:JVEU01|1015_5266_269573_11+,11&gt;</GBSeqid>
         <GBSeqid>gi|876108632</GBSeqid>
      </GBSeq_other-seqids>
      <GBSeq_project>PRJNA267549</GBSeq_project>
      <GBSeq_keywords>
         <GBKeyword>WGS</GBKeyword>
      </GBSeq_keywords>
      <GBSeq_source>Stenotrophomonas maltophilia</GBSeq_source>
      <GBSeq_organism>Stenotrophomonas maltophilia</GBSeq_organism>
      <GBSeq_taxonomy>Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales; Xanthomonadaceae; Stenotrophomonas; Stenotrophomonas maltophilia group</GBSeq_taxonomy>
      <GBSeq_references>
         <GBReference>
            <GBReference_authors>
               <GBAuthor>Roach,D.J.</GBAuthor>
            </GBReference_authors>
            <GBReference_title>A Year of Infection in the Intensive Care Unit: Prospective Whole Genome Sequencing of Bacterial Clinical Isolates Reveals Cryptic Transmissions and Novel Microbiota</GBReference_title>
            <GBReference_journal>PLoS Genet. 11 (7), E1005413 (2015)</GBReference_journal>
      </GBSeq_references>
      <GBSeq_comment>Source DNA available from Steve Salipante, University of Washington Department of Laboratory Medicine, Box 357110, 1959 NE Pacific Street, NW120 Seattle, WA 98195-7110; ##Genome-Assembly-Data-START## Assembly Method ABYSS v. 1.3.5 Genome Coverage 21x Sequencing Technology Illumina HiSeq ##Genome-Assembly-Data-END##</GBSeq_comment>
      <GBSeq_feature-table>
         <GBFeature>
            <GBFeature_key>source</GBFeature_key>
            <GBFeature_location>1..5266</GBFeature_location>
            <GBFeature_intervals>
               <GBInterval>
                  <GBInterval_from>1</GBInterval_from>
                  <GBInterval_to>5266</GBInterval_to>
                  <GBInterval_accession>JVEU01000013.1</GBInterval_accession>
               </GBInterval>
            </GBFeature_intervals>
            <GBFeature_quals>
               <GBQualifier>
         [...]
               </GBQualifier>
            </GBFeature_quals>
         </GBFeature>
      </GBSeq_feature-table>
      <GBSeq_sequence>g[...]catcccgaactcggaa</GBSeq_sequence>
      <GBSeq_xrefs>
         <GBXref>
      </GBSeq_xrefs>
   </GBSeq>
   <GBSeq>
      <GBSeq_locus>HQ844023</GBSeq_locus>
      <GBSeq_length>942</GBSeq_length>
      <GBSeq_strandedness>single</GBSeq_strandedness>
      <GBSeq_moltype>RNA</GBSeq_moltype>
      <GBSeq_topology>linear</GBSeq_topology>
      <GBSeq_division>VRL</GBSeq_division>
      <GBSeq_update-date>01-AUG-2011</GBSeq_update-date>
      <GBSeq_create-date>01-AUG-2011</GBSeq_create-date>
      <GBSeq_definition>Rotavirus A HC91xUK reassortant (UKg9KC-1) NSP3 protein gene, complete cds</GBSeq_definition>
      <GBSeq_primary-accession>HQ844023</GBSeq_primary-accession>
      <GBSeq_accession-version>HQ844023.1</GBSeq_accession-version>
      <GBSeq_other-seqids>
         <GBSeqid>gb|HQ844023.1|</GBSeqid>
         <GBSeqid>gi|341832806</GBSeqid>
      </GBSeq_other-seqids>
      <GBSeq_source>Rotavirus A HC91xUK reassortant (UKg9KC-1)</GBSeq_source>
      <GBSeq_organism>Rotavirus A HC91xUK reassortant (UKg9KC-1)</GBSeq_organism>
      <GBSeq_taxonomy>Viruses; dsRNA viruses; Reoviridae; Sedoreovirinae; Rotavirus; Rotavirus A</GBSeq_taxonomy>
      <GBSeq_references>
         <GBReference>
            <GBReference_reference>1</GBReference_reference>
            <GBReference_position>1..942</GBReference_position>
            <GBReference_authors>
               <GBAuthor>Rippinger,C.M.</GBAuthor>
            </GBReference_authors>
            <GBReference_title>Genome sequences of the NIH UK-bovine reassortant vaccine components</GBReference_title>
            <GBReference_journal>Unpublished</GBReference_journal>
         </GBReference>
         <GBReference>
         </GBReference>
      </GBSeq_references>
      <GBSeq_feature-table>
         <GBFeature>
            <GBFeature_key>source</GBFeature_key>
            <GBFeature_location>1..942</GBFeature_location>
            <GBFeature_intervals>
               <GBInterval>
               </GBInterval>
            </GBFeature_intervals>
            <GBFeature_quals>
               <GBQualifier>
             [...]
               </GBQualifier>
            </GBFeature_quals>
         </GBFeature>
         <GBFeature>
            <GBFeature_key>CDS</GBFeature_key>
         [...]
            </GBFeature_quals>
         </GBFeature>
      </GBSeq_feature-table>
      <GBSeq_sequence>atgct[...]tgaatag</GBSeq_sequence>
   </GBSeq>
</GBSet>

I would like to retrieve only GBSeq_accession-version, GBSeq_moltype, GBSeq_topology, GBSeq_organism and GBSeq_taxonomy, so the outpul would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<GBSet>
   <GBSeq>
      <GBSeq_moltype>DNA</GBSeq_moltype>
      <GBSeq_topology>linear</GBSeq_topology>
      <GBSeq_accession-version>JVEU01000013.1</GBSeq_accession-version>
      <GBSeq_organism>Stenotrophomonas maltophilia</GBSeq_organism>
      <GBSeq_taxonomy>Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales; Xanthomonadaceae; Stenotrophomonas; Stenotrophomonas maltophilia group</GBSeq_taxonomy>
  </GBSeq>
  <GBSeq>
   [...]
  </GBSeq>
</GBSet>

Is there any way to specify the field we want to retrieve in the entrez query?

sequence entrez • 1.4k views
ADD COMMENT
3
Entering edit mode
4.3 years ago
Sej Modha 4.8k

You can extract all fields except taxonomy using following command in command-line e-utils.

esearch -db nucleotide -query JVEU01000013|esummary|xtract -pattern DocumentSummary  -element AccessionVersion,MolType,Topology,Organism

For taxonomy you can run a separate command

elink -target taxonomy -db nuccore -id "JVEU01000013"|efetch -format xml|xtract -pattern Taxon -first Lineage
ADD COMMENT

Login before adding your answer.

Traffic: 1744 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6