Is there a specific option in BioPython that lets you call the gene name given an address of a nucleotide?
Entering edit mode
8.2 years ago
Tom ▴ 20

So I have a simple task, but I cannot find the specific method on BioPython documentation. I have a genbank of a bacteria. I have a list numbers, which are locations of specific SNPs on the genome:


I want to know what gene each of these addresses correspond to. What series of methods do I use?

As a clear example, the final output should look like this:

234 : Is within Dehydroxylate III
1135 : Is within Dehydroxylate III
51515 : Is within Collagen Repeat factor alpha
63475 : is within DNA Polymerase III
94818 : is within DNA Helicase subunit 1
BioPython • 2.1k views
Entering edit mode

You know the name of your bacteria, am I right?

See the following post, find my answer to that bacterial question.

C: where can I get environmental bacteria genome in fasta format (as many as possib

Current version of NCBI provides a file with urls for all bacteria in the database.

Find your bacterium name, press on it, copy this url and apply in some browser.

You will easily find all files available in NCBI for your bacterium.

faa-files have a list of all bacterial genes, and they provide coordinates of these genes

ffn-files have the same gene order but no coordinates, unfortunately.

What is nice - these files are collinear. Most bacteria have a single chromosome.

First, you find your gene in faa-file (they have their coordinates in their header).

So this is a simple search in protein fasta-file headers,

If .fna file is provided, you can map the corresponding pretranslated gene (.ffn)

having needed coordinate from your list to the chromosome *.fna-file.

I would try to do something like that. It doesn't look easy,

so I hope some biopython option does exist.

Entering edit mode

No , this is for custom genbanks.

I need an output file that will look like this

234 : Is within Dehydroxylate III
1135 : Is within Phosponate gamma
51515 : Is within Collagen Repeat factor alpha
63475 : is within DNA Polymerase III
Entering edit mode
8.1 years ago
Steven Lakin ★ 1.8k

You're looking for this section in the biopython tutorial and cookbook Location testing

You can use the Python keyword in with a SeqFeature or location object to see if the base/residue for a parent coordinate is within the feature/location or not.

For example, suppose you have a SNP of interest and you want to know which features this SNP is within, and lets suppose this SNP is at index 4350 (Python counting!). Here is a simple brute force solution where we just check all the features one by one in a loop:

>>> from Bio import SeqIO
>>> my_snp = 4350
>>> record ="", "genbank")
>>> for feature in record.features:
...     if my_snp in feature:
...         print("%s %s" % (feature.type, feature.qualifiers.get('db_xref')))
source ['taxon:229193']
gene ['GeneID:2767712']
CDS ['GI:45478716', 'GeneID:2767712']
Note that gene and CDS features from GenBank or EMBL files defined with joins are the union of the exons – they do not cover any introns.
Entering edit mode

That's exactly the approach I was going to suggest - although note that this isn't as efficient as it could be without indexing the feature locations, something Biopython doesn't currently offer. Note do double check your SNP co-ordinates are using the right offset for the Python style used here!


Login before adding your answer.

Traffic: 1253 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6