How to use Biopython for fetching metadata of NCBI/GenBank/RefSeq assembly identifiers?
1
1
Entering edit mode
6.1 years ago
O.rka ▴ 740

I'm trying to use Python and Biopython to fetch metadata for a given assembly identifier. In this case I'm looking for GCF_000005845.2

from Bio import Entrez

# GCF_000005845.2
id_ecoli = "GCF_000005845.2"
esummary_handle = Entrez.esummary(db="assembly", id=id_ecoli, report="full")
record = Entrez.read(esummary_handle, validate=False)
record
# DictElement({'DocumentSummarySet': DictElement({'DocumentSummary': []}, attributes={'status': 'OK'})}, attributes={})

This is the type of data I'm looking for below: https://www.ncbi.nlm.nih.gov/assembly/GCF_000005845.2/ enter image description here

I could make a HTML scraper but I don't want to reinvent the wheel if there is already something available.

biopython ncbi entrez Assembly fetch • 6.6k views
ADD COMMENT
0
Entering edit mode

handle = Entrez.efetch(db="assembly", id="GCF_000005845.2") record = Entrez.read(handle) record

['5845', '2']

ADD REPLY
11
Entering edit mode
6.1 years ago

If JSON/XML output will be useful to you, the following script can be used.

#!/usr/bin/python

from Bio import Entrez
import json

#Increase query limit to 10/s & get warnings
Entrez.email = ""
#Get one from https://www.ncbi.nlm.nih.gov/account/settings/ page
Entrez.api_key=""

term="GCF_000005845.2"
#Finds the ids associated with the assembly
def get_ids(term):
    ids = []
    handle = Entrez.esearch(db="assembly", term=term)
    record = Entrez.read(handle)
    ids.append(record["IdList"])
    return ids

#Fetch raw output
def get_raw_assembly_summary(id):
    handle = Entrez.esummary(db="assembly",id=id,report="full")
    record = Entrez.read(handle)
    #Return individual fields
    #XML output: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=assembly&id=79781&report=%22full%22
    #return(record['DocumentSummarySet']['DocumentSummary'][0]['AssemblyName']) #This will return the Assembly name
    return(record)

#JSON formatted output
def get_assembly_summary_json(id):
    handle = Entrez.esummary(db="assembly",id=id,report="full")
    record = Entrez.read(handle)
    #Convert raw output to json
    return(json.dumps(record, sort_keys=True,indent=4, separators=(',', ': ')))

#Test
for id in get_ids(term):
    #print(get_raw_assembly_summary(id)) #For raw output
    print(get_assembly_summary_json(id)) #JSON Formatted
ADD COMMENT
0
Entering edit mode

Wow, this answer is incredible! Thank you so much. This actually makes a lot of sense on why my previous version wasn't working. Did you write this or did you find it in the docs somewhere?

ADD REPLY
2
Entering edit mode

I've written these functions. More functions are available in the following link.

https://github.com/arupgsh/text_mining/blob/master/biopython_fun.py

ADD REPLY

Login before adding your answer.

Traffic: 1336 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6