Question: biopython entrez lost connection?
2
gravatar for flogin
4 weeks ago by
flogin250
Brazil
flogin250 wrote:

Hey guys,

I'm in the following situiation:

I have a tab-delimited file like this:

aag2_ctg_1000   YP_009342179.1  neg 104655  105278
aag2_ctg_1000   YP_009666655.1  pos 257933  258349
aag2_ctg_1000   YP_073558.1 neg 15802   16581
aag2_ctg_1000   YP_073558.1 neg 61637   62266
aag2_ctg_1000   YP_073558.1 pos 43922   44227
aag2_ctg_1000   YP_073558.1 pos 334345  334659
aag2_ctg_1000   YP_073558.1 pos 481376  482059
aag2_ctg_1001   NP_043933.1 pos 123492  123662
aag2_ctg_1001   NP_077596.1 pos 422453  422671
aag2_ctg_1001   YP_001426837.1  pos 43143   43544

Where the 1st column has a contig name and the second column a protein id from ncbi. So all this proteins are viral proteins, in a manner to retrieve the taxonomy of each virus from each protein, I wrote the following script:

Code modified after genomax suggestions:

#!/usr/bin/python3
from Bio import Entrez
import pandas as pd
import csv

Entrez.email = ***@***
Entrez.api_key = *****************************************************
error_tax = []
prot_id = []
count = 0
with open(input_file+'.fmt.blastx.filtred.annot.bed.merged.tax.tmp','r') as tax_out_tmp, open(input_file+'.fmt.blastx.filtred.annot.bed.merged.tax.tmp2','w+',newline ='') as tax_out_tmp2:
    tax_tmp_reader = csv.reader(tax_out_tmp, delimiter='\t')
    tax_tmp2_writer = csv.writer(tax_out_tmp2)
    for line in tax_tmp_reader:
        prot_id.append(line[1])
    prot_id = list(dict.fromkeys(prot_id))
    count = 1
    tax_reference_list = []
    tax_levels_list = []
    for var_prot_id in prot_id:
        handle_code = Entrez.efetch(db = 'protein', id = var_prot_id, rettype = 'gb')
        handle_code_var = str(re.findall(r'\/db_xref="taxon:.*',handle_code.read()))
        handle_code_var = re.sub(r"\[.*:", '', handle_code_var)
        handle_code_var = re.sub(r'".*\]', '', handle_code_var)
        tax_reference_list.append([var_prot_id,handle_code_var])
        print(f'{count}/{len(prot_id)}')
        count += 1
        time.sleep(0.15)
    count = 1
    for i in tax_reference_list:
        sk = ''
        od = ''
        fm = ''
        gn = ''
        sp = ''
        prot_id = i[0].rstrip('\n')
        tax_id = i[1].rstrip('\n')
        try:
            handle_tax_data = Entrez.efetch(db="Taxonomy",id=tax_id, retmode="xml")
            record_tax_data = Entrez.read(handle_tax_data)
            for x in record_tax_data[0]["LineageEx"]:
                if 'superkingdom' in x.values():
                    sk = x['ScientificName']
                if 'order' in x.values():
                    od = x['ScientificName']
                if 'family' in x.values():
                    fm = x['ScientificName']
                if 'genus' in x.values():
                    gn = x['ScientificName']
                if 'species' in x.values():
                    sp = x['ScientificName']
        except:
            print(f'effor with: {tax_id}')
            continue
        tax_levels_list.append([prot_id,tax_id,sk,od,fm,gn,sp])
        print(f'{count}/{len(tax_reference_list)}')
        count += 1
        time.sleep(0.15)
    tax_tmp2_writer.writerows(tax_levels_list)
tax_out_tmp.close()
tax_out_tmp2.close()
tax_complete_list = []
with open(input_file+'.fmt.blastx.filtred.annot.bed.merged.tax.tmp','r') as tax_out_tmp:
    tax_tmp_reader = csv.reader(tax_out_tmp, delimiter='\t')
    for row_prot in tax_tmp_reader:
        with open(input_file+'.fmt.blastx.filtred.annot.bed.merged.tax.tmp2','r') as tax_out_tmp2:
            tax_tmp2_reader = csv.reader(tax_out_tmp2, delimiter=',')
            for row_tax in tax_tmp2_reader:
                if row_prot[1] == row_tax[0]:
                    tax_complete_list.append([row_prot[0]+':'+row_prot[3]+'-'+row_prot[4],row_prot[1],row_tax[1],row_tax[2],row_tax[3],row_tax[4],row_tax[5],row_tax[6],row_prot[2],row_prot[3],row_prot[4]])
df_tax = pd.DataFrame(tax_complete_list, columns=["Element-ID","Protein-ID","Virus-ID","Super-Kingdom","Order","Family","Genus","Species","Sense","Start","End"])
df_tax = df_tax.replace(r'^\s*$', np.nan, regex=True)
df_tax.to_csv(input_file+'.fmt.blastx.filtred.annot.bed.merged.tax.csv', sep=',', index = False, na_rep='NULL')

So, for small datasets ( < 10k lines) this script works perfectly, but for some datasets (with 40k, 80k of lines) this script just hangs in one step to send de protein ID to ncbi protein database, I note this because a protein ID is printed, but the script don't move on. e.g.:

Protein ID: YP_009465730.1
DONE:2079134
Protein ID: YP_009507247.1
DONE:2083181
Protein ID: YP_009507248.1
DONE:2083181
Protein ID: YP_009507248.1

No errors are reported, just pass some hours in the same protein. I think that can be some problem with the connection to NCBI, but I have no idea how I can solve this. Can anyone help?

Thanks !

biopython taxonomy python ncbi • 127 views
ADD COMMENTlink modified 21 days ago • written 4 weeks ago by flogin250
1

So if I understand correctly, your code works fine for a small subset but not for a larger chunk of ids (lines), so then your code should be technically okay. In your code you are requesting for information from severs (here NCBI). I do not know how biopython does it internally, but these should be more or less like API requests (just masked by biopython). For such requests usually these servers have set a limit on how many requests can be made (per minute/per hour) from a given IP (of your machine), I am not sure how much does NCBI set its limit to. What I am guessing is you do not cross this threshold for 10k lines, but do so for 40k and above lines. What you could try doing, in my opinion is the following (like a dirty hack) : since you know 10k lines are not a problem, send requests to the server in batches of 10k and after every batch put your code to sleep, you will have to determine the sleep time by doing trial and errors. What sleep does is it hibernates your code for set time, so you do not make requests to ncbi for that time and hopefully do not run over the hourly(?) threshold. I do not know how sleep / hibernate works in python but maybe this might be helpful. I had the same problem with making requests to ensembl and the sleep trick works nicely for me.

ADD REPLYlink written 4 weeks ago by manaswwm110

Nice, i'm gonne try that

ADD REPLYlink written 4 weeks ago by flogin250
1

Have you signed up for NCBI_API_KEY and are you using it? If you have not you should do that first. When you are doing massive queries like this you need to build in some delay to prevent overloading a public resource.

ADD REPLYlink written 4 weeks ago by genomax85k

I use the entrez.email, and put a sleep time of 0.4 seconds for each query, should be a time greater than this?

ADD REPLYlink written 4 weeks ago by flogin250
1

You should sign up for NCBI API KEY. See Frequency/Timing section for the EntrezDirect page to see the query limits.

I am sure there is a taxdump file that has an association of accession numbers with taxID's. Will look for it later. It would be ideal to parse that file rather than use Eutils in this case.

ADD REPLYlink written 4 weeks ago by genomax85k

Right, thanks genomax, I create an API key and put it inside my script with Entrez.api_key. So I'm gonna test now if everything runs normal.

ADD REPLYlink written 4 weeks ago by flogin250
1

While this answer is not addressing the problem described in the original post (that has been dealt with some comments above) this is a preferred way to achieve the same end, especially with a large number of proteins involved.

Parsing this protein accessions to taxonomy (Note: 5.6G download) file locally would be much faster than doing Endtredirect queries.

$ zgrep YP_009465730 prot.accession2taxid.gz 
YP_009465730    YP_009465730.1  2079134 1347731956
ADD REPLYlink written 4 weeks ago by genomax85k

Thanks for your suggestions genomax, in fact, make the search locally can be a good choice, I'm gonna test this. Moreover, your suggestion to run this locally, allow me to think in to reduce the number of queries to online search, because I was sending all protein IDs from a blast output (circa of 72k of protein IDs), but when I remove the redundancy of protein IDs, rest only 1644 IDs, in this way I reduce a lot of the online search, and make a table of taxonomy related to each protein ID, and after this, I crossed the data from blast output and the data of taxonomy related to each protein ID, this quick fix avoid this huge online search.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by flogin250

I remove the redundancy of protein IDs, rest only 1644 IDs, in this way I reduce a lot of the online search

Excellent.

ADD REPLYlink written 29 days ago by genomax85k

I got complete this search, but some times still have a bug, the script just hangs when send one protein ID, or taxonomy ID. I put the ncbi e-mail and key for API, and put time.sleep(0.15) after each search (because with API key we can send 10 queries per second). I'll update the code present on the original answer, maybe have some library on python that can be used to avoid this problem?

ADD REPLYlink written 21 days ago by flogin250

I still think you should do the search locally with the file I linked above.

ADD REPLYlink written 21 days ago by genomax85k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 933 users visited in the last hour