Question

Biopython Entrez.esearch always returns the same irrelevant Entrez id regardless of query

1

Entering edit mode

4.5 years ago

abel ▴ 10

Hi,

I tried to search the NCBI Gene database using Biopython.Entrez for the following random terms: ["egfr","pi3k","puma"].

I was expecting to get the entries for the EGFR, Pi3K and PUMA, However, the idList always returned '7157'(TP53) as the first item for all 3 entries.

Before I ran this test, I searched for TP53 using esearch and it correctly returned 7157 as the first entry. However, every subsequent esearch query returned 7157 as the first element for every single query, even though TP53 is not related to any of the queries. Here is my code:

idlist=[]
terms=["egfr","pi3k","puma"]
for termd in terms:
    print("search for ",termd)
    info2= Entrez.esearch(db ="gene",term=termd)
    print("\n \n", info2)
    record2= Entrez.read(info2)
    print(record2)
    idlist.append( record2["IdList"][0])
print(idlist)

and here is what the shell returned:

search for  egfr


 <_io.TextIOWrapper encoding='utf-8'>
{'Count': '6303', 'RetMax': '20', 'RetStart': '0', 'IdList': ['7157', '1956', '7124', '7422', '3569', '7040', '22059', '2064', '2099', '3091', '351', '672', '4318', '9370', '5243', '1401', '207', '367', '4790', '21898'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'egfr[All Fields]', 'Field': 'All Fields', 'Count': '6303', 'Explode': 'N'}, 'GROUP'], 'QueryTranslation': 'egfr[All Fields]'}
search for  pi3k


 <_io.TextIOWrapper encoding='utf-8'>
{'Count': '15392', 'RetMax': '20', 'RetStart': '0', 'IdList': ['7157', '1956', '7124', '7422', '3569', '7040', '22059', '2064', '2099', '3586', '3091', '351', '672', '4318', '9370', '5243', '1401', '207', '367', '4790'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'pi3k[All Fields]', 'Field': 'All Fields', 'Count': '15392', 'Explode': 'N'}, 'GROUP'], 'QueryTranslation': 'pi3k[All Fields]'}
search for  puma


 <_io.TextIOWrapper encoding='utf-8'>
{'Count': '24385', 'RetMax': '20', 'RetStart': '0', 'IdList': ['7157', '1956', '7124', '7040', '22059', '2064', '207', '21898', '6774', '3845', '1029', '4609', '2475', '21803', '596', '5594', '1026', '332', '11651', '355'], 'TranslationSet': [{'From': 'puma', 'To': '"Puma concolor"[Organism] OR "Puma"[Organism] OR puma[All Fields]'}], 'TranslationStack': [{'Term': '"Puma concolor"[Organism]', 'Field': 'Organism', 'Count': '23959', 'Explode': 'Y'}, {'Term': '"Puma"[Organism]', 'Field': 'Organism', 'Count': '23996', 'Explode': 'Y'}, 'OR', {'Term': 'puma[All Fields]', 'Field': 'All Fields', 'Count': '24385', 'Explode': 'N'}, 'OR', 'GROUP'], 'QueryTranslation': '"Puma concolor"[Organism] OR "Puma"[Organism] OR puma[All Fields]'}
['7157', '7157', '7157']

you can see that the rest of the properties are different, but the very first element in idList is always 7157.

How do I get around this problem?

Thank you very much.

gene biopython entrez • 951 views

ADD COMMENT • link updated 4.5 years ago by massa.kassa.sc3na ▴ 600 • written 4.5 years ago by abel ▴ 10

score 0 · Answer 1 · 2019-10-16

Hi, this is not a problem of Biopython.

This is property of Entrez which - as far as I know - returns hits as it founds them.

What you need to do is to sort them. I would suggest sorting term "relevance".

You need to add the sort to your code, for the "sort=relevance", it would look like this:

idlist=[]
terms=["egfr","pi3k","puma"]
for termd in terms:
    print("search for ",termd)
    info2= Entrez.esearch(db ="gene",term=termd, sort="relevance")
    print("\n \n", info2)
    record2= Entrez.read(info2)
    print(record2)
    idlist.append( record2["IdList"][0])
print(idlist)

For database gene the following sorting options are available (these are different for each database):

Chromosome
Gene Weight
Name
Relevance