Xml Returned Using Efetch Differs To That Downloaded From Query At Pubmed Website
4
1
Entering edit mode
11.8 years ago
Neilfws 49k

If I go to PubMed and enter this query, which currently returns around 2290 results:

"Retraction of Publication"[Publication Type]

then select "Send to File", format = XML, "Create File", the download generally takes a few seconds and returns a file with only one DOCTYPE line, as expected:

grep -c DOCTYPE ~/Downloads/pubmed_result.xml
# 1
grep DOCTYPE ~/Downloads/pubmed_result.xml
# http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_120101.dtd">

If I perform an equivalent query using the BioRuby implementation of EUtils:

require "rubygems"
require "bio"

ncbi    = Bio::NCBI::REST.new
Bio::NCBI.default_email = "me@me.com"

retmax = ncbi.esearch_count("Retraction of Publication[ptyp]", {"db" => "pubmed"})
search = ncbi.esearch("Retraction of Publication[ptyp]", {"db" => "pubmed", "retmax" => retmax})
result = ncbi.efetch(search, {"db" => "pubmed", "retmode" => "xml"})

File.open("pubmed_result.xml", "w") do |f|
  f.write(result)
end

it takes significantly longer to return the XML file, the file has a slightly different size and it contains multiple DOCTYPE lines which breaks XML parsing:

grep -c DOCTYPE pubmed_result.xml
# 23

It appears that Efetch returns separate, complete XML "chunks" and concatenates them into one file. This does not occur if a smaller subset of the variable search, e.g. search[0..4] is passed to efetch. So:

  1. Is this issue due to passing too many IDs to efetch?
  2. Have other people observed it using other implementations of Eutils?
  3. Can it be resolved using e.g. a POST query as suggested in the Eutils documentation?
pubmed eutils xml • 6.8k views
ADD COMMENT
0
Entering edit mode

On further investigation: I forgot to pass retmax as a parameter to efetch. However, even when this is done, results seem to be returned in batches of 100. Does not occur when the Bioperl Eutils library is used, so may be a bug in (my version of?) BioRuby.

ADD REPLY
0
Entering edit mode

On even further investigation: it seems that adding the parameter "step = retmax" to efetch solves the problem.

ADD REPLY
3
Entering edit mode
11.8 years ago

Here is a thread with similar problems: http://comments.gmane.org/gmane.comp.python.bio.general/6962

EFetch doesn't seem to cope well with fetching large amounts of records at once. It looks like with your query, EFetch is breaking it up into chucks of 1,000 records. Maybe they've fixed it since then to automatically break up queries into chunks of 1,000 when the returned number is large?

It is interesting though that their web implementation doesn't have this problem. You would assume they use the same EFetch system for their web interface...

ADD COMMENT
0
Entering edit mode

Seems more like chunks of ~ 100 records? 23 DOCTYPE lines for 2290 records.

ADD REPLY
1
Entering edit mode
11.8 years ago
Neilfws 49k

To answer my own question: efetch in BioRuby can take the parameter step= for maximum number of records to retrieve at one time.

So this line works in my code:

result = ncbi.efetch(search, {"db" => "pubmed", "retmode" => "xml"}, step = retmax)

Documented here.

ADD COMMENT
1
Entering edit mode
11.8 years ago
Chris Maloney ▴ 360

I can add some evidence to support that it is the Ruby library not properly aggregating chunked results, and not a problem on the NCBI end. You can try this query out manually from the web interface:

It does take significantly longer to download the XML this way, as you mentioned; so that's strange. But the resultant document, when it finally does arrive, only has one doctype declaration.

ADD COMMENT
1
Entering edit mode
11.8 years ago
wdiwdi ▴ 380

Efetch does indeed have problems when the return data size gets too large (it's the total data size, not the record count), after which there may be timeouts between various internal components of the Eutils system which can lead to incomplete or corrupted results. For queries returning only UIDs in XML format, my experience is that the safe limit is about 2.5 mil records. So it is better to split the retrieval into chunks.

ADD COMMENT
0
Entering edit mode

The issue here is not incomplete or corrupt data, just how the chunks are returned.

ADD REPLY
0
Entering edit mode

Yes. That was supposed to be a comment on the first and second answer.

ADD REPLY
0
Entering edit mode

I confirm this.

ADD REPLY

Login before adding your answer.

Traffic: 1951 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6