Question

How to access specific gene data from the NCBI Gene database with Biopython

0

Entering edit mode

2.9 years ago

João Afonso • 0

Hi all,

I'm trying to get the all the gene synonyms for a certain gene in NCBI with Biopython.

from Bio import Entrez
Entrez.email = "A.N.Other@example.com"
handle = Entrez.efetch(db="gene", id="3675", rettype="", retmode="")
results = handle.read()

This code will return all the data related to a certain gene in the format ASN.1 (check here for the possible formats being returned).

I have now looked in the whole Biopython documentation there's no way to easily access components of this returned asn.1 string, no parser nothing. I even tried a couple of python asn.1 packages but they seem to only decode binary asn.1 files.

Ideally I'd like to have dictionary format or similar to access elements by key. What's the best way to approach this?

Thanks a lot!

Biopython • 633 views

ADD COMMENT • link updated 2.9 years ago by Istvan Albert 100k • written 2.9 years ago by João Afonso • 0

score 1 · Answer 1 · 2021-06-02

You could fetch the XML format and turn that into a dictionary like so:

# pip install xmltodict

import xmltodict

from pprint import pprint

from Bio import Entrez
Entrez.email = "A.N.Other@example.com"
handle = Entrez.efetch(db="gene", id="3675", rettype="", retmode="xml")
results = handle.read()

data = xmltodict.parse(results)

pprint (data)

prints a gigantic file.

You might be much better off getting the information with entrez direct:

efetch -db gene -id 3675 -format xml > out.xml
cat out.xml | xtract -pattern Gene-ref -element Gene-ref_locus

prints:

ITGA3