Question

Unable to extract country name from genbank file

0

Entering edit mode

2.9 years ago

Akbar • 0

I want to extract the "Country of isolation" from the Genbank file. Tried to run the following command in Google collab. Accessions.txt contains accession numbers i.e. ['GCA_001719305.1', 'GCA_903231415.1', 'GCA_903231425.1', 'GCA_903231445.1', 'GCA_903231465.1', 'GCA_903231475.1', 'GCA_903231515.1']

from Bio import Entrez

# Read the accessions from a file
accessions_file = 'accessions.txt'
with open(accessions_file) as f:
    ids = f.read().split('\n')

# Fetch the entries from Entrez
Entrez.email = 'name@example.org'  # Insert your email here
handle = Entrez.efetch('nuccore', id=ids, retmode='xml')
response = Entrez.read(handle)

# Parse the entries to get the country
def extract_countries(entry):
    sources = [feature for feature in entry['GBSeq_feature-table']
               if feature['GBFeature_key'] == 'source']

    for source in sources:
        qualifiers = [qual for qual in source['GBFeature_quals']
                      if qual['GBQualifier_name'] == 'country']

        for qualifier in qualifiers:
            yield qualifier['GBQualifier_value']

for entry in response:
    accession = entry['GBSeq_primary-accession']
    for country in extract_countries(entry):
        print(accession, country, sep=',')

Getting following error. Please help me to resolve this. Thanks in advance.

HTTPError                                 Traceback (most recent call last)
<ipython-input-17-4518f5766224> in <module>()
      1 Entrez.email = 'pryp88@gmail.com'
----> 2 handle = Entrez.efetch('nuccore', id=ids, retmode='xml')
      3 response = Entrez.read(handle)

7 frames
/usr/lib/python3.7/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 400: Bad Request

Genbank data extraction • 1.3k views

ADD COMMENT • link updated 2.9 years ago by vkkodali_ncbi ★ 3.7k • written 2.9 years ago by Akbar • 0

0

Entering edit mode

Your code is not reaching the extraction part - it fails to even query Entrez. Can you run python in an interactive session and use one ID to ensure your code works fine until the response = line?

ADD REPLY • link 2.9 years ago by Ram 43k

0

Entering edit mode

The code alignment seems incorrect. It could be due to formating issue or due to genuine code issue.

Try with one example.

from Bio import Entrez

Entrez.email = 'name@example.org'  # Insert your email here
handle = Entrez.efetch('nuccore', id="GCA_001719305.1", retmode='xml')
response = Entrez.read(handle)
print(response)

I am suspecting if the alignment is correct, and the code looks like u posted, ids is a list which efetch is not able to recognise, maybe.

Are you behind proxy?

ADD REPLY • link 2.9 years ago by pbpanigrahi ▴ 420

score 0 · Answer 1 · 2021-06-05

The GCA accessions are genome assembly accessions (see here for a brief explanation). You cannot query the nuccore database with these identifiers and download sequence data.

If your starting point is a list of assembly accessions and would like to find out the geographic location of the sample, you should first use epost to load the identifiers to history and then use elink with target biosample to get linked biosample records and download those records using efetch. At this point you will have a biosample record containing the geographic location where the sample was isolated.

On the Unix command-line, using Entrez Direct, this process translates to the following command:

$ echo 'GCA_001719305.1' | epost -db assembly | elink -db assembly -target biosample | efetch 
1: Pathogen: clinical or host-associated sample from Burkholderia cenocepacia
Identifiers: BioSample: SAMN05301538; Sample name: Burkho_UC; SRA: SRS1539607
Organism: Burkholderia cenocepacia
Attributes:
    /strain="C141BCUC"
    /collected by="Laboratorio de Microbiologia de la Red de Salud UC-Christus"
    /collection date="2015-04"
    /geographic location="Chile: Santiago"
    /host="Homo sapiens"
    /host disease="Cystic Fibrosis"
    /isolation source="Expectoration sample"
    /latitude and longitude="33.44342168310788 S 70.64062669873238 W"
    /host description="Pediatric pacient"
    /host disease outcome="Death"
    /host sex="female"
Accession: SAMN05301538 ID: 5301538

If you pipe the output of elink to esummary, an XML will be returned that you can parse using the Entrez Direct tool xtract (or any other XML parsing tool).