I have a list of NCBI assembly accession numbers, and I'm trying to return species name (latin and common) plus taxid. A few weeks ago, I wrote a bit of code that appeared to do the trick.
#!/usr/bin/env python import csv from Bio import Entrez Entrez.email = xxx' def get_organism_taxonomy(accession): handle = Entrez.efetch(db='assembly', id=accession, rettype='xml') record = Entrez.read(handle) organism = record['DocumentSummarySet']['DocumentSummary']['Organism'] taxonomy = record['DocumentSummarySet']['DocumentSummary']['Taxid'] return organism, taxonomy def update_csv(input_file, output_file): with open(input_file, 'r') as csvfile: reader = csv.reader(csvfile) header = next(reader) # Read the header row header += ['Organism', 'Taxonomy'] # Add new column headers rows =  for row in reader: accession = row # Assuming accession numbers are in the first column print(accession) organism, taxonomy = get_organism_taxonomy(accession) row += [organism, taxonomy] # Append organism and taxonomy to the row rows.append(row) with open(output_file, 'w', newline='') as csvfile: writer = csv.writer(csvfile) writer.writerow(header) writer.writerows(rows) nput_csv = '/lab/wengpj01/genomes/accessions_pt2.csv' output_csv = '/lab/wengpj01/genomes/all_genomes_taxa_pt2.csv' update_csv(input_csv, output_csv)
Now I'm returning to the same code, and it doesn't appear to be working anymore. I keep getting an error that says:
Traceback (most recent call last): File "./get_info.py", line 9, in <module> handle=Entrez.efetch(db='assembly', id='GCA_000001405.29',rettype='xml') File "/usr/local/lib/python3.8/dist-packages/Bio/Entrez/__init__.py", line 207, in efetch return _open(cgi, variables, post=post) File "/usr/local/lib/python3.8/dist-packages/Bio/Entrez/__init__.py", line 606, in _open handle = urlopen(cgi) File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python3.8/urllib/request.py", line 531, in open response = meth(req, response) File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response response = self.parent.error( File "/usr/lib/python3.8/urllib/request.py", line 569, in error return self._call_chain(*args) File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(*args) File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 400: Bad Request
This error happens anytime I set rettype="xml" even though I swear this worked 10 days ago. The only thing that I know I've changed is that I got an api key. I've tried setting that in the code, but it doesn't seem to help.
What's weird is that when I try other rettypes, I get trivial responses. For instance:
handle=Entrez.efetch(db='assembly', id='GCA_000001405.29') record=Entrez.read(handle) print(record)
rather than a full xml about the human genome.
Any advice about how I could get this working would be greatly appreciated! Thanks so much!