Question: Difference between NCBI Entrez formats (e.g., xml, docsum) for efetch
0
gravatar for paulparsons
3.6 years ago by
paulparsons130
Canada/London/Western
paulparsons130 wrote:

I am using the NCBI EDirect UNIX command line tools (http://www.ncbi.nlm.nih.gov/books/NBK179288/) to query the gene database and get some basic information of results (e.g., chromosome location, description, gene name). The documentation seems obscure and confusing to me (maybe because I don't have a bioinformatics background). After playing with the different formats, I have discovered that the docsum format seems to best suit my needs. Although I have come to this conclusion through trial-and-error, I still do not have a clear understanding of whether this is really true, nor of the difference between the possible formats for efetch. For example, what is the difference between the xml and docsum formats? Why and when should one use them?

Although I can retrieve the format outlines by doing the following:

 

esearch -db gene -query "brca1 [ALL]human[ORGN]" | \
  efetch -format docsum | \
  xtract -outline

and

esearch -db gene -query "brca1 [ALL]human[ORGN]" | \
  efetch -format xml | \
  xtract -outline

 

this gives only the syntax of the formats and not the semantics. It does not help me understand when or why someone might prefer one format over the other. Obviously they are different, but without a background in this area, it is impossible to infer the semantics from the syntax.

To make things more confusing, the documentation seems to state that the 'docsum' format is not actually a 'specified format':

Records can be retrieved in specified formats or as document summaries:

  • efetch downloads records or reports in a designated format.

 

Moreover, it would seem from the naming of the formats that docsum is a summary of the full xml document. However, there seem to be certain fields in the docsum that are not in the xml. As I don't have biology or genomics background, I can't tell whether certain terms refer to different things or are simply synonymous.

The answer seems like it should be so simple, yet I can't find it anywhere! Any help is much appreciated.

 

ADD COMMENTlink modified 8 months ago by DCGenomics300 • written 3.6 years ago by paulparsons130

I understand that the fields in the 'docsum' format are organised differently, and do not necessarily follow the XML syntax. If you save to file your results before xtract (in other words, if you save to file the output of efetch), you will realise that xml uses way more space. Which fields are in the docsum that you did not find in the XML? Edirect tools seem to be very handy and powerful, but I agree that the documentation is not yet complete.

ADD REPLYlink written 3.6 years ago by Zag10
2
gravatar for paulparsons
3.6 years ago by
paulparsons130
Canada/London/Western
paulparsons130 wrote:

Turns out that with EDirect, efetch -format docsum is the same as the e-utils summary, whereas efetch -format xml is the same as e-utils efetch.

Here's an answer I received from NCBI:

The edirect efetch is wrapper to a combination of two eutils fcgis; efetch.fcgi and esummary.fcgi. Basically the edirect efetch –docsum is the same as the eutils esummary.fcgi

Edirect:

efetch –db nuccore -id 6092233 –format docsum

…is the same as

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=nuccore&id=6092233&version=2.0

This is very different from

Efetch –db nuccore –id 6092233 –format xml

…which is the same as

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=6092233&retmode=xml&rettype=gb

 

This still doesn't give me a complete understanding of their differences, but is helpful in suggesting where else to look for information. As Zag commented previously (Difference between NCBI Entrez formats (e.g., xml, docsum) for efetch), it is possible to redirect the results to a file for easier investigation, by doing something like

esearch -db gene -query "brca1 [ALL]human[ORGN]" | \
  efetch -format docsum > results.out

and comparing the docsum with the xml. Also, the outlines can be compared using the original method I mentioned.

- also of interest: release notes for EDirect Version 1.10

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by paulparsons130

In the link provided by NCBI, for example, a big difference is that the XML has a field for the sequence. In your examples though (you are querying the gene database) this is not the case.

ADD REPLYlink written 3.6 years ago by Zag10
1
gravatar for DCGenomics
8 months ago by
DCGenomics300
United States
DCGenomics300 wrote:

Perhaps this will be useful to you:

https://github.com/NCBI-Hackathons/EDirectCookbook

ADD COMMENTlink written 8 months ago by DCGenomics300

Please post this as a new "tutorial" post. Would be helpful for many.

ADD REPLYlink written 8 months ago by genomax39k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1341 users visited in the last hour