Question: Difference between NCBI Entrez formats (e.g., xml, docsum) for efetch
gravatar for paulparsons
6.5 years ago by
paulparsons130 wrote:

I am using the NCBI EDirect UNIX command line tools ( to query the gene database and get some basic information of results (e.g., chromosome location, description, gene name). The documentation seems obscure and confusing to me (maybe because I don't have a bioinformatics background). After playing with the different formats, I have discovered that the docsum format seems to best suit my needs. Although I have come to this conclusion through trial-and-error, I still do not have a clear understanding of whether this is really true, nor of the difference between the possible formats for efetch. For example, what is the difference between the xml and docsum formats? Why and when should one use them?

Although I can retrieve the format outlines by doing the following:


esearch -db gene -query "brca1 [ALL]human[ORGN]" | \
  efetch -format docsum | \
  xtract -outline


esearch -db gene -query "brca1 [ALL]human[ORGN]" | \
  efetch -format xml | \
  xtract -outline


this gives only the syntax of the formats and not the semantics. It does not help me understand when or why someone might prefer one format over the other. Obviously they are different, but without a background in this area, it is impossible to infer the semantics from the syntax.

To make things more confusing, the documentation seems to state that the 'docsum' format is not actually a 'specified format':

Records can be retrieved in specified formats or as document summaries:

  • efetch downloads records or reports in a designated format.


Moreover, it would seem from the naming of the formats that docsum is a summary of the full xml document. However, there seem to be certain fields in the docsum that are not in the xml. As I don't have biology or genomics background, I can't tell whether certain terms refer to different things or are simply synonymous.

The answer seems like it should be so simple, yet I can't find it anywhere! Any help is much appreciated.


ADD COMMENTlink modified 3.5 years ago by DCGenomics320 • written 6.5 years ago by paulparsons130

I understand that the fields in the 'docsum' format are organised differently, and do not necessarily follow the XML syntax. If you save to file your results before xtract (in other words, if you save to file the output of efetch), you will realise that xml uses way more space. Which fields are in the docsum that you did not find in the XML? Edirect tools seem to be very handy and powerful, but I agree that the documentation is not yet complete.

ADD REPLYlink modified 10 months ago by RamRS30k • written 6.5 years ago by Zag10
gravatar for paulparsons
6.5 years ago by
paulparsons130 wrote:

Turns out that with EDirect, efetch -format docsum is the same as the e-utils summary, whereas efetch -format xml is the same as e-utils efetch.

Here's an answer I received from NCBI:

The edirect efetch is wrapper to a combination of two eutils fcgis; efetch.fcgi and esummary.fcgi. Basically the edirect efetch -docsum is the same as the eutils esummary.fcgi


efetch -db nuccore -id 6092233 -format docsum the same as

This is very different from

efetch -db nuccore -id 6092233 -format xml

...which is the same as

This still doesn't give me a complete understanding of their differences, but is helpful in suggesting where else to look for information. As Zag commented previously (commented previously), it is possible to redirect the results to a file for easier investigation, by doing something like

esearch -db gene -query "brca1 [ALL]human[ORGN]" | \
  efetch -format docsum > results.out

and comparing the docsum with the xml. Also, the outlines can be compared using the original method I mentioned.

ADD COMMENTlink modified 10 months ago by RamRS30k • written 6.5 years ago by paulparsons130

In the link provided by NCBI, for example, a big difference is that the XML has a field for the sequence. In your examples though (you are querying the gene database) this is not the case.

ADD REPLYlink written 6.5 years ago by Zag10
gravatar for DCGenomics
3.5 years ago by
United States
DCGenomics320 wrote:

Perhaps this will be useful to you:

ADD COMMENTlink written 3.5 years ago by DCGenomics320

Please post this as a new "tutorial" post. Would be helpful for many.

ADD REPLYlink written 3.5 years ago by genomax91k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1176 users visited in the last hour