Question

Getting species names and taxa id from assembly accession number

0

Entering edit mode

10 months ago

kwhiggins27 ▴ 10

Hi folks!

I have a list of NCBI assembly accession numbers, and I'm trying to return species name (latin and common) plus taxid. A few weeks ago, I wrote a bit of code that appeared to do the trick.

#!/usr/bin/env python
import csv
from Bio import Entrez

Entrez.email = xxx'

def get_organism_taxonomy(accession):
    handle = Entrez.efetch(db='assembly', id=accession, rettype='xml')
    record = Entrez.read(handle)
    organism = record['DocumentSummarySet']['DocumentSummary'][0]['Organism']
    taxonomy = record['DocumentSummarySet']['DocumentSummary'][0]['Taxid']
    return organism, taxonomy

def update_csv(input_file, output_file):
    with open(input_file, 'r') as csvfile:
        reader = csv.reader(csvfile)
        header = next(reader)  # Read the header row
        header += ['Organism', 'Taxonomy']  # Add new column headers

        rows = []
        for row in reader:
            accession = row[0]  # Assuming accession numbers are in the first column
            print(accession)
            organism, taxonomy = get_organism_taxonomy(accession)
            row += [organism, taxonomy]  # Append organism and taxonomy to the row
            rows.append(row)

    with open(output_file, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(header)
        writer.writerows(rows)
 nput_csv = '/lab/wengpj01/genomes/accessions_pt2.csv'
output_csv = '/lab/wengpj01/genomes/all_genomes_taxa_pt2.csv'
update_csv(input_csv, output_csv)

Now I'm returning to the same code, and it doesn't appear to be working anymore. I keep getting an error that says:

Traceback (most recent call last):   File "./get_info.py", line 9, in <module>
    handle=Entrez.efetch(db='assembly', id='GCA_000001405.29',rettype='xml')   File "/usr/local/lib/python3.8/dist-packages/Bio/Entrez/__init__.py", line 207, in efetch
    return _open(cgi, variables, post=post)   File "/usr/local/lib/python3.8/dist-packages/Bio/Entrez/__init__.py", line 606, in _open
    handle = urlopen(cgi)   File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)   File "/usr/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)   File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(   File "/usr/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)   File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)   File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 400: Bad Request

This error happens anytime I set rettype="xml" even though I swear this worked 10 days ago. The only thing that I know I've changed is that I got an api key. I've tried setting that in the code, but it doesn't seem to help.

What's weird is that when I try other rettypes, I get trivial responses. For instance:

handle=Entrez.efetch(db='assembly', id='GCA_000001405.29')
record=Entrez.read(handle)
print(record)

Returns only:

['1405', '29']

rather than a full xml about the human genome.

Any advice about how I could get this working would be greatly appreciated! Thanks so much!

efetch biopython entrez • 1.1k views

ADD COMMENT • link updated 10 months ago by MirianT_NCBI ▴ 720 • written 10 months ago by kwhiggins27 ▴ 10

2

Entering edit mode

There must be some temporary glitch.

Following command works (via EntrezDirect):

$ esearch -db assembly -query GCA_000001405.29 | esummary | xtract -pattern DocumentSummary -element Taxid,SpeciesName
9606    Homo sapiens

ADD REPLY • link 10 months ago by GenoMax 141k

0

Entering edit mode

You should probably not advertise your personal email. It is not needed to understand the code.

I don't know what exactly is the problem with your attempts, but I do know that many servers temporarily fail to operate properly during the weekends. You know, maintenance and such things. If the script was working before, I'd try it on a working day before concluding that something is wrong.

ADD REPLY • link 10 months ago by Mensur Dlakic ★ 27k

score 2 · Answer 1 · 2023-06-05

Hi,
As an alternative, you could use NCBI Datasets for this task. NCBI Datasets allows users to retrieve metadata reports in JSON or JSON-Lines format, and those can be converted to tsv using dataformat, a datasets companion tool. If you install datasets using conda, both tools are included.

Assuming you have a list of accessions (one per line, here genomes.list), you can use the summary option in datasets, combined with `dataformat.

datasets summary genome accession --inputfile genomes.list --as-json-lines | \
dataformat tsv genome --fields accession,organism-common-name,organism-name,organism-tax-id


Assembly Accession      Organism Common Name    Organism Name   Organism Taxonomic ID
GCA_005887515.3         domestic yak    Bos grunniens   30521
GCA_000002285.4         dog     Canis lupus familiaris  9615
GCF_018350175.1         domestic cat    Felis catus     9685
GCF_000001405.40        human   Homo sapiens    9606
GCF_000001635.27        house mouse     Mus musculus    10090

For a list of dataformat genome fields, you can refer to this documentation page: dataformat genome tsv .

Please feel free to reach out if you have any other questions.