Getting species names and taxa id from assembly accession number
1
0
Entering edit mode
10 months ago
kwhiggins27 ▴ 10

Hi folks!

I have a list of NCBI assembly accession numbers, and I'm trying to return species name (latin and common) plus taxid. A few weeks ago, I wrote a bit of code that appeared to do the trick.

#!/usr/bin/env python
import csv
from Bio import Entrez

Entrez.email = xxx'

def get_organism_taxonomy(accession):
    handle = Entrez.efetch(db='assembly', id=accession, rettype='xml')
    record = Entrez.read(handle)
    organism = record['DocumentSummarySet']['DocumentSummary'][0]['Organism']
    taxonomy = record['DocumentSummarySet']['DocumentSummary'][0]['Taxid']
    return organism, taxonomy

def update_csv(input_file, output_file):
    with open(input_file, 'r') as csvfile:
        reader = csv.reader(csvfile)
        header = next(reader)  # Read the header row
        header += ['Organism', 'Taxonomy']  # Add new column headers

        rows = []
        for row in reader:
            accession = row[0]  # Assuming accession numbers are in the first column
            print(accession)
            organism, taxonomy = get_organism_taxonomy(accession)
            row += [organism, taxonomy]  # Append organism and taxonomy to the row
            rows.append(row)

    with open(output_file, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(header)
        writer.writerows(rows)
 nput_csv = '/lab/wengpj01/genomes/accessions_pt2.csv'
output_csv = '/lab/wengpj01/genomes/all_genomes_taxa_pt2.csv'
update_csv(input_csv, output_csv)

Now I'm returning to the same code, and it doesn't appear to be working anymore. I keep getting an error that says:

Traceback (most recent call last):   File "./get_info.py", line 9, in <module>
    handle=Entrez.efetch(db='assembly', id='GCA_000001405.29',rettype='xml')   File "/usr/local/lib/python3.8/dist-packages/Bio/Entrez/__init__.py", line 207, in efetch
    return _open(cgi, variables, post=post)   File "/usr/local/lib/python3.8/dist-packages/Bio/Entrez/__init__.py", line 606, in _open
    handle = urlopen(cgi)   File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)   File "/usr/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)   File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(   File "/usr/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)   File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)   File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 400: Bad Request

This error happens anytime I set rettype="xml" even though I swear this worked 10 days ago. The only thing that I know I've changed is that I got an api key. I've tried setting that in the code, but it doesn't seem to help.

What's weird is that when I try other rettypes, I get trivial responses. For instance:

handle=Entrez.efetch(db='assembly', id='GCA_000001405.29')
record=Entrez.read(handle)
print(record)

Returns only:

['1405', '29']

rather than a full xml about the human genome.

Any advice about how I could get this working would be greatly appreciated! Thanks so much!

efetch biopython entrez • 1.1k views
ADD COMMENT
2
Entering edit mode

There must be some temporary glitch.

Following command works (via EntrezDirect):

$ esearch -db assembly -query GCA_000001405.29 | esummary | xtract -pattern DocumentSummary -element Taxid,SpeciesName
9606    Homo sapiens
ADD REPLY
0
Entering edit mode

You should probably not advertise your personal email. It is not needed to understand the code.

I don't know what exactly is the problem with your attempts, but I do know that many servers temporarily fail to operate properly during the weekends. You know, maintenance and such things. If the script was working before, I'd try it on a working day before concluding that something is wrong.

ADD REPLY
2
Entering edit mode
10 months ago
MirianT_NCBI ▴ 720

Hi,
As an alternative, you could use NCBI Datasets for this task. NCBI Datasets allows users to retrieve metadata reports in JSON or JSON-Lines format, and those can be converted to tsv using dataformat, a datasets companion tool. If you install datasets using conda, both tools are included.

Assuming you have a list of accessions (one per line, here genomes.list), you can use the summary option in datasets, combined with `dataformat.

datasets summary genome accession --inputfile genomes.list --as-json-lines | \
dataformat tsv genome --fields accession,organism-common-name,organism-name,organism-tax-id


Assembly Accession      Organism Common Name    Organism Name   Organism Taxonomic ID
GCA_005887515.3         domestic yak    Bos grunniens   30521
GCA_000002285.4         dog     Canis lupus familiaris  9615
GCF_018350175.1         domestic cat    Felis catus     9685
GCF_000001405.40        human   Homo sapiens    9606
GCF_000001635.27        house mouse     Mus musculus    10090

For a list of dataformat genome fields, you can refer to this documentation page: dataformat genome tsv .

Please feel free to reach out if you have any other questions.

ADD COMMENT

Login before adding your answer.

Traffic: 2036 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6