Entrez epost + elink returns results out of order with Biopython
3
1
Entering edit mode
9.6 years ago
Chris F. ▴ 20

I ran into this today and wanted to toss it out there. It appears that using the the Biopython interface to Entrez at NCBI, it's not possible to get results back (at least from elink) in the correct (same as input) order. Please see the code below for an example. I have thousands of GIs for which I need to get taxonomy information, and querying them individually is painfully slow due to NCBI restrictions.

from Bio import Entrez
Entrez.email = "my@email.com"
ids = ["148908191", "297793721", "48525513", "507118461"]
search_results = Entrez.read(Entrez.epost("protein", id=','.join(ids)))
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
print Entrez.read(Entrez.elink(webenv=webenv,
                         query_key=query_key,
                         dbfrom="protein",
                         db="taxonomy"))

print "-------"

for i in ids:
    search_results = Entrez.read(Entrez.epost("protein", id=i))
    webenv = search_results["WebEnv"]
    query_key = search_results["QueryKey"]
    print Entrez.read(Entrez.elink(webenv=webenv,
                         query_key=query_key,
                         dbfrom="protein",
                         db="taxonomy"))

Results:

[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '211604'}, {u'Id': '81972'}, {u'Id': '32630'}, {u'Id': '3332'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['148908191', '297793721', '48525513', '507118461'], u'LinkSetDbHistory': [], u'ERROR': []}]
-------
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '3332'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['148908191'], u'LinkSetDbHistory': [], u'ERROR': []}]
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '81972'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['297793721'], u'LinkSetDbHistory': [], u'ERROR': []}]
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '211604'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['48525513'], u'LinkSetDbHistory': [], u'ERROR': []}]
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '32630'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['507118461'], u'LinkSetDbHistory': [], u'ERROR': []}]

The elink documentation at NCBI says this should be possible, but only by passing multiple 'id=', but this doesn't appear possible with the Biopython epost interface. Has anyone else seen this or am I missing something obvious?

Thanks!

Note: this is a cross-post from StackOverflow at https://stackoverflow.com/questions/25775309/entrez-epost-elink-returns-results-out-of-order-with-biopython

python ncbi biopython • 6.4k views
ADD COMMENT
0
Entering edit mode
9.6 years ago

It seems that passing identical parameter names multiple times is not possible in BioPython epost since it passes them as dictionary.

On the other hand the entrez interface is a very thin layer over the eutils URLs that it accesses. It is very easy to build your own URL that populate parameters correctly. You could use a library like requests http://docs.python-requests.org/en/latest/ to make it super simple.

If that does not work then reordering the results is the next workaround - put your results into a dictionary keyed by the id then iterate on the original keys and pull the values from the dictionary. That should be no problem for data sizes of tens of thousands.

ADD COMMENT
0
Entering edit mode

There are no identical URL parameter names - the ID parameter is held as a single string (comma separated), so where do you think the dictionary step (and loss of order) happens?

ADD REPLY
0
Entering edit mode

What the OP states that the EUtils documentation seems to recommend is that one can force a certain ordering by passing identical parameters like so:

query?id=1&id=2&id=3 

In epost that does not seem to be not possible because the parameter is expected to be a dictionary

param=dict(id=1)

But it would be possible (the default python urlencode would support it) if the parameter were in the form of a a list of tuples like so

[('id', 1), ('id',2), ('id', 3)]

I have not actually checked the statement about ordering for validity - I just looked at how this worked as it interested me if it were possible to pass the identically named parameters since that appears to be a corner case of utility.

ADD REPLY
0
Entering edit mode

Where doe the NCBI say the URL should repeat the id parameter like that? See http://www.ncbi.nlm.nih.gov/books/NBK25499/ which says for the the elink id argument "UID list. Either a single UID or a comma-delimited list of UIDs may be provided. ..." and for the epost id argument "UID list. Either a single UID or a comma-delimited list of UIDs may be provided."

To me this says we should build the URL using .../epost.fcgi?id=id1,id2&db=... instead of .../epost.fcgi?id=id1&id=id2&db=... (which older Biopython code used to do, but the NCBI started giving an Error 500 here so we changed to the comma separated list as of https://github.com/biopython/biopython/commit/f18361653531b48282cb73d221550d42612fbba9).

ADD REPLY
0
Entering edit mode

From that page, under the ELink section"

If more than one id parameter is provided, ELink will perform a separate link operation for the set of UIDs specified by each id parameter. This effectively accomplishes "one-to-one" links and preserves the connection between the input and output UIDs.

Find one-to-one links from protein to gene.

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=gene&id=15718680&id=157427902&id=119703751

ADD REPLY
0
Entering edit mode

I find that documentation misleading, to me the introduction says don't do http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=gene&id=15718680&id=157427902&id=119703751 (which gives the one-to-one results) but instead do http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=gene&id=15718680,157427902,119703751 which gives the muddled results. This is confusing :(

ADD REPLY
0
Entering edit mode

Thanks, Istvan. Yeah, I'll probably just build my own query, but I wanted to make this visible so it could be explored.

ADD REPLY
0
Entering edit mode

Given the two different modes of elink with multiple links, would you prefer Biopython always built its URL with the repeated &id=... bits in order to get the one-to-one mapping?

Or something like if you give Biopython a comma separated string it uses that as is (single &id=... in the URL as now) but if you give a list of IDs it uses multiple &id=... in the URL to get one-to-one mappings?

ADD REPLY
0
Entering edit mode

Thanks, Peter. Probably not always, but it would be nice to have the option ;-)

FWIW, I was able to get around my 1:1 problem, by using building my own elink URLs (with requests) and batching them, returning XML, and then parsing it with Entrez.read().

ADD REPLY
0
Entering edit mode

Issue filed with Biopython elink URL construction, https://github.com/biopython/biopython/issues/361

ADD REPLY
0
Entering edit mode
9.6 years ago
Peter 6.0k

Because Python functions can only take a named argument once, you cannot do epost(..., id=id1, id=id2, ...) so instead we expect you to either use a list epost(..., id=my_id_list, ...) or as in your example a comma separated string epost(..., id=",".join(my_id_list), ...) which is what the code does internally if you use a list, see https://github.com/biopython/biopython/commit/f18361653531b48282cb73d221550d42612fbba9

As to the result order, that seems to be the down to the NCBI - print the raw XML and you get this:

<?xml version="1.0"?>
<!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 23 November 2010//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_101123.dtd">
<eLinkResult>
<LinkSet>
    <DbFrom>protein</DbFrom>
    <IdList>
        <Id>148908191</Id>
        <Id>297793721</Id>
        <Id>48525513</Id>
        <Id>507118461</Id>
    </IdList>
    <LinkSetDb>
        <DbTo>taxonomy</DbTo>
        <LinkName>protein_taxonomy</LinkName>
        <Link>
            <Id>211604</Id>
        </Link>
        <Link>
            <Id>81972</Id>
        </Link>
        <Link>
            <Id>32630</Id>
        </Link>
        <Link>
            <Id>3332</Id>
        </Link>
    </LinkSetDb>
</LinkSet>
</eLinkResult>

You are hoping for 148908191 --> 3332, 297793721 --> 81972, 48525513 --> 211604 and 507118461 --> 32630 here?

Update: Issue filed with Biopython elink URL construction, https://github.com/biopython/biopython/issues/361

ADD COMMENT
0
Entering edit mode

Yes, Peter, exactly. It appears the 1:1 mapping is lost when the GIs are submitted under a single "id="

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode
9.4 years ago
Wayne ★ 2.0k

Looking into this when I was sorting out an approach to use ELink recently, I found that like Peter said, the result is down to NCBI.

If you had not tried to play nice and use the Entrez History Server, it would have worked.

If you look at this information under ELink Considerations you'll see that trying to use Webenv and a query_key from the Entrez History server causes them to be returned "as a group without information about which nucleotide record is linked to which protein record."

If you just skip the EPost step and send your list to ELink , it will work (as Peter discusses here and demos here).

Here is how you can keep the 1:1 correspondence:

from Bio import Entrez
Entrez.email = "A.N.Other@example.com"     # Always tell NCBI who you are. PUT YOUR EMAIL THERE.
protein_gi_numbers = ["148908191", "297793721", "48525513", "507118461"]
taxonomy_uids = []

#ELink step
handle = Entrez.elink(dbfrom="protein", db="taxonomy", id=protein_gi_numbers)
result = Entrez.read(handle)
handle.close()

#Mine the results
for each_record in result:
    taxonomy_id = each_record["LinkSetDb"][0]["Link"][0]["Id"]
    taxonomy_uids.append(taxonomy_id)

#Report    
#print result
print taxonomy_uids

Result:

['3332', '81972', '211604', '32630']

(You can see the code above run live in a fully interactive in-browser IPython console window here.)

My understanding from the Biopython Tutorial and Cookbook about the Entrez Guidelines is Biopython enforces that you can make no more than three requests per second. However, if you were going to use ELink on over 100 uids, you have to do it outside of peak times. I assume each record in the list (called 'protein_gi_numbers' here) actually counts as an individual request? Maybe Peter can comment on this?

ADD COMMENT
1
Entering edit mode

I am unsure if the NCBI consider each HTTP request one query, or if as you suggest it depends on the number of IDs requested as well. In either case, 100s of IDs would be best done outside their peak hours.

ADD REPLY
0
Entering edit mode

For the record, I should also add that I found the NCBI taxonomy records to be lacking for some needs and resorted to using Python to invoke other APIs, namely The Global Biodiversity Information Facility's Taxon Web Service. I later thought I wish I had used the Integrated Taxonomic Information System. That one seemed good for my needs of assigning taxons to vertebrates, but as far as I know it is not yet on the whitelist of PythonAnywhere.com, which is where I do a lot of development. The old code is here.

ADD REPLY

Login before adding your answer.

Traffic: 3519 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6