Unable to extract country name from genbank file
Entering edit mode
4 months ago
Akbar • 0

I want to extract the "Country of isolation" from the Genbank file. Tried to run the following command in Google collab. Accessions.txt contains accession numbers i.e. ['GCA_001719305.1', 'GCA_903231415.1', 'GCA_903231425.1', 'GCA_903231445.1', 'GCA_903231465.1', 'GCA_903231475.1', 'GCA_903231515.1']

from Bio import Entrez

# Read the accessions from a file
accessions_file = 'accessions.txt'
with open(accessions_file) as f:
    ids = f.read().split('\n')

# Fetch the entries from Entrez
Entrez.email = 'name@example.org'  # Insert your email here
handle = Entrez.efetch('nuccore', id=ids, retmode='xml')
response = Entrez.read(handle)

# Parse the entries to get the country
def extract_countries(entry):
    sources = [feature for feature in entry['GBSeq_feature-table']
               if feature['GBFeature_key'] == 'source']

    for source in sources:
        qualifiers = [qual for qual in source['GBFeature_quals']
                      if qual['GBQualifier_name'] == 'country']

        for qualifier in qualifiers:
            yield qualifier['GBQualifier_value']

for entry in response:
    accession = entry['GBSeq_primary-accession']
    for country in extract_countries(entry):
        print(accession, country, sep=',')

Getting following error. Please help me to resolve this. Thanks in advance.

HTTPError                                 Traceback (most recent call last)
<ipython-input-17-4518f5766224> in <module>()
      1 Entrez.email = 'pryp88@gmail.com'
----> 2 handle = Entrez.efetch('nuccore', id=ids, retmode='xml')
      3 response = Entrez.read(handle)

7 frames
/usr/lib/python3.7/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 400: Bad Request
Genbank data extraction • 340 views
Entering edit mode

Your code is not reaching the extraction part - it fails to even query Entrez. Can you run python in an interactive session and use one ID to ensure your code works fine until the response = line?

Entering edit mode

The code alignment seems incorrect. It could be due to formating issue or due to genuine code issue.

Try with one example.

from Bio import Entrez

Entrez.email = 'name@example.org'  # Insert your email here
handle = Entrez.efetch('nuccore', id="GCA_001719305.1", retmode='xml')
response = Entrez.read(handle)

I am suspecting if the alignment is correct, and the code looks like u posted, ids is a list which efetch is not able to recognise, maybe.

Are you behind proxy?

Entering edit mode
4 months ago
vkkodali ★ 2.7k

The GCA accessions are genome assembly accessions (see here for a brief explanation). You cannot query the nuccore database with these identifiers and download sequence data.

If your starting point is a list of assembly accessions and would like to find out the geographic location of the sample, you should first use epost to load the identifiers to history and then use elink with target biosample to get linked biosample records and download those records using efetch. At this point you will have a biosample record containing the geographic location where the sample was isolated.

On the Unix command-line, using Entrez Direct, this process translates to the following command:

$ echo 'GCA_001719305.1' | epost -db assembly | elink -db assembly -target biosample | efetch 
1: Pathogen: clinical or host-associated sample from Burkholderia cenocepacia
Identifiers: BioSample: SAMN05301538; Sample name: Burkho_UC; SRA: SRS1539607
Organism: Burkholderia cenocepacia
    /collected by="Laboratorio de Microbiologia de la Red de Salud UC-Christus"
    /collection date="2015-04"
    /geographic location="Chile: Santiago"
    /host="Homo sapiens"
    /host disease="Cystic Fibrosis"
    /isolation source="Expectoration sample"
    /latitude and longitude="33.44342168310788 S 70.64062669873238 W"
    /host description="Pediatric pacient"
    /host disease outcome="Death"
    /host sex="female"
Accession: SAMN05301538 ID: 5301538

If you pipe the output of elink to esummary, an XML will be returned that you can parse using the Entrez Direct tool xtract (or any other XML parsing tool).


Login before adding your answer.

Traffic: 2535 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6