Question

urllib.error.HTTPError: HTTP Error 400: Bad Request with biopython Entrez

0

Entering edit mode

14 months ago

marine.bergot • 0

hi! I have some issue with requests i'm doing with Entrez. I'm trying to get informations with bioprojectID inside bioproject database but it seems to work when it wants.. Inside for loop i get :

Entrez.efetch(db="bioproject", retmode="xml", id=bio_id)305084
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 8, in get_publication_infos
  File "/Users/cea/miniconda3/lib/python3.10/site-packages/Bio/Entrez/__init__.py", line 196, in efetch
    return _open(request)
  File "/Users/cea/miniconda3/lib/python3.10/site-packages/Bio/Entrez/__init__.py", line 586, in _open
    handle = urlopen(request)
  File "/Users/cea/miniconda3/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/cea/miniconda3/lib/python3.10/urllib/request.py", line 525, in open
    response = meth(req, response)
  File "/Users/cea/miniconda3/lib/python3.10/urllib/request.py", line 634, in http_response
    response = self.parent.error(
  File "/Users/cea/miniconda3/lib/python3.10/urllib/request.py", line 563, in error
    return self._call_chain(*args)
  File "/Users/cea/miniconda3/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/Users/cea/miniconda3/lib/python3.10/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

and outisde for loop for the same bio_id :

handle = Entrez.efetch(db="bioproject", retmode="xml", id=305084)
bio_file = handle.read()
soup = BS.BeautifulSoup(bio_file, 'xml')

and i can print soup with no problem i guess there is problem with my way of managing the API requests? if someone has any idea.

This is my full code :

from Bio import Entrez
import bs4 as BS
import lxml
import ipdb

Entrez.email = "XXXXX"
Entrez.api_key ="XXXXXX"

def get_ids_bioproject(IDs):
    #return a list of all bioproject ids related to assemblies
    bio_ids = set()
    dict_bioproject_assembly = list()
    for ID in IDs:
        esummary_handle = Entrez.esummary(db="assembly", id=ID, report="full")
        esummary_record = Entrez.read(esummary_handle)
        bio_id = esummary_record['DocumentSummarySet']['DocumentSummary'][0]['GB_BioProjects'][0]['BioprojectId']
        #dict_bioproject_assembly[ID] = bio_id
        bio_ids.add(bio_id)
    return(dict_bioproject_assembly, bio_ids)


def get_publication_infos(bioproject_ids):
    #return a dict with information about publication related to assembly through bioproject id
    dict_info_journal = dict()
    list_odd_blank_assembly = list()
    for bio_id in bioproject_ids:
        print(bio_id)
        print('Entrez.efetch(db="bioproject", retmode="xml", id=bio_id)' +bio_id)
        handle = Entrez.efetch(db="bioproject", retmode="xml", id=bio_id)
        bio_file = handle.read()
        soup = BS.BeautifulSoup(bio_file, 'xml')
        print(soup)
        if soup.find_all('Publication') == list():
            list_odd_blank_assembly.append(bio_id)
            continue
        else:
            if len(soup.find_all('Title')) < 2:
                list_odd_blank_assembly.append(bio_id)
                continue
            else:
                dict_info_journal[bio_id] = dict()
                dict_info_journal[bio_id]['Title'] = soup.find_all('Title')[1].string
                dict_info_journal[bio_id]['Journal'] = soup.JournalTitle.string
                dict_info_journal[bio_id]['Author'] = soup.Last.string+" et al."
                dict_info_journal[bio_id]['Year'] = soup.Year.string
                dict_info_journal[bio_id]['Pubmed'] =  "https://pubmed.ncbi.nlm.nih.gov/"+soup.find("Publication")['id']
        handle.close()
    return(dict_info_journal,list_odd_blank_assembly)

query = "Microbacterium[Organism] AND latest_refseq[filter] NOT partial[filter]"
handle = Entrez.esearch(term=query, db="Assembly", retmax=900)
IDs = Entrez.read(handle)["IdList"]

(dict_bioproject_assembly, bioproject_ids) = get_ids_bioproject(IDs)
(dict_info_journal, list_odd_blank_assembly) = get_publication_infos(bioproject_ids)

thanks for your help!

(I'm using python 3.10.9 and version 1.80 of biopython)

ncbi bioython entrez • 1.2k views

ADD COMMENT • link updated 14 months ago by GenoMax 141k • written 14 months ago by marine.bergot • 0

0

Entering edit mode

but it seems to work when it wants.

NCBI is a public resource and when doing large queries against it please use a pause/sleep section in your code. I see that you are using NCBI API key but that also has a limit of queries per unit time.

That ID does not seem to have any publication with it.

$ esearch -db bioproject -query 305084 | elink -target pubmed
<ENTREZ_DIRECT>
  <Db>pubmed</Db>
  <WebEnv>MCID_63ea2eadf92e4e51131407d1</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>0</Count>
  <Step>2</Step>
</ENTREZ_DIRECT>

ADD REPLY • link 14 months ago by GenoMax 141k

0

Entering edit mode

yeah but my request is against bioproject db not assembly db i guess that's why. as you can see in my exemple i'm able to find it outside of my loop. i can just use sleep() function ?

ADD REPLY • link 14 months ago by marine.bergot • 0

0

Entering edit mode

Ok I have corrected the search database above. There appears to be no publication associated with this ID. I understand that NCBI will add publication (in case submitters do not do it) if it finds one for a particular SRA accession but this process is likely not perfect.

Use some way to put a delay in between your queries. sleep() would be fine.

ADD REPLY • link 14 months ago by GenoMax 141k

0

Entering edit mode

well i'm not saying i'm finding paper for this request, but just i don't have error 400. This is my ouput from diffrent terminal :

>>> handle = Entrez.efetch(db="bioproject", retmode="xml", id=305084)
>>> bio_file = handle.read()
>>> soup = BS.BeautifulSoup(bio_file, 'xml')
>>> soup
<?xml version="1.0" encoding="utf-8"?>
<RecordSet><DocumentSummary uid="305084">
<Project>
<ProjectID>
<ArchiveID accession="PRJNA305084" archive="NCBI" id="305084"/>
<LocalID>bp0</LocalID>
<LocalID>bp0</LocalID>
</ProjectID>
<ProjectDescr>
<Name>Bacteria</Name>
<Title>Great Artesian Basin gas bore wells Genome sequencing and assembly</Title>
<Description>In an effort to discover novel bacterial species or novel ecotypes with potential bioremediation or biotechnological applications, bacteria were isolated from water/sediment samples taken from gas producing bore wells, some involved in coal seam gas (CSG) extraction activities. Genome sequencing of isolates using the WGS sequencing approach was conducted in an effort to elucidate functional profiles of novel isolates or identify novel functional properties of new ecotypes. This genome sequencing project is part of a larger research effort to better understand the involvement of microbial communities in the production of gas from these bore wells.</Description>
<ProjectReleaseDate>2016-02-01T00:00:00Z</ProjectReleaseDate>
<Relevance>
<Environmental>yes</Environmental>
</Relevance>
<LocusTagPrefix assembly_id="GCA_001560855" biosample_id="SAMN04324283">AT864</LocusTagPrefix>
<LocusTagPrefix assembly_id="GCA_001543375" biosample_id="SAMN04356940">AU359</LocusTagPrefix>
<LocusTagPrefix assembly_id="GCA_001543455" biosample_id="SAMN04357041">AU374</LocusTagPrefix>
<LocusTagPrefix assembly_id="GCA_001617335" biosample_id="SAMN04357042">AU375</LocusTagPrefix>
<LocusTagPrefix assembly_id="GCA_001620065" biosample_id="SAMN04378105">AVP41</LocusTagPrefix>
<LocusTagPrefix assembly_id="GCA_001620055" biosample_id="SAMN04378109">AVP42</LocusTagPrefix>
<LocusTagPrefix assembly_id="GCA_001620045" biosample_id="SAMN04378115">AVP43</LocusTagPrefix>
</ProjectDescr>
<ProjectType>
<ProjectTypeSubmission>
<Target capture="eWhole" material="eGenome" sample_scope="eMultispecies">
<Organism species="-1" taxID="2">
<OrganismName>Bacteria</OrganismName>
<Supergroup>eBacteria</Supergroup>
<BiologicalProperties>
<Environment>
<OptimumTemperature>C</OptimumTemperature>
<TemperatureRange>eMesophilic</TemperatureRange>
<Habitat>eAquatic</Habitat>
</Environment>
</BiologicalProperties>
</Organism>
<Provider>Bharat K. C. Patel</Provider>
<Description>Water sampled from gas producing bore wells of the Surat Basin (GAB) of southern Queensland including bores and produced water treatment ponds from coal seam gas (CSG) extraction plant in the same region.</Description>
</Target>
<Method method_type="eSequencing"/>
<Objectives>
<Data data_type="eSequence"/>
<Data data_type="eAssembly"/>
</Objectives>
<IntendedDataTypeSet>
<DataType>genome sequencing and assembly</DataType>
</IntendedDataTypeSet>
<ProjectDataTypeSet>
<DataType>Genome sequencing and assembly</DataType>
</ProjectDataTypeSet>
</ProjectTypeSubmission>
</ProjectType>
</Project>
<Submission last_update="2015-12-04" submission_id="SUB1218543" submitted="2015-12-04">
<Description>
<!-- Submitter information has been removed -->
<Organization role="owner" type="institute" url="https://www.griffith.edu.au/">
<Name>Griffith University</Name>
<!-- Contact information has been removed -->
</Organization>
<Access>public</Access>
</Description>
<Action action_id="SUB1218543-bp0"/>
</Submission>
</DocumentSummary>
</RecordSet>
>>>

and after parsing that i can say that yes there is no publication. My question is why in one terminal it's working and in other i have Error 400 ? i tried with time.sleep(5) after each request, the result is the same :(

ADD REPLY • link 14 months ago by marine.bergot • 0

0

Entering edit mode

Mysteries of internet. In general EntreDirect does not seem to have good way of handling errors when they occur during piping operations.

ADD REPLY • link 14 months ago by GenoMax 141k