Question: Extracting blocks features by their coordinates in a genome (using BioPython)?
gravatar for little_more
7 months ago by
little_more20 wrote:

Let's say I have a GenBank ID for a genome (BA000007.3) and a list with genome coordinates for blocks:

blocks = [(1443423, 1444553), ...]

I need to find out what genes (or parts of what genes) are covered by these blocks.

I've tried this:

from Bio import Entrez, SeqIO = ""
with Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id="BA000007.3") as handle: 
    seq_record =, "gb")

sub_record = seq_record[1443423:1444553]

It returns 0, as no genes are fully covered by this block. However, if you open the NCBI record in a browser and change region shown you will see that there are 2 genes that are partially covered by this block.

I've also tried substracting with SeqFeature and FeatureLocation but it didn't work either (a new sequence doesn't have features at all).

So is there any way to do this without manually checking each block for each genome?

biopython • 267 views
ADD COMMENTlink modified 7 months ago by vkkodali1.7k • written 7 months ago by little_more20

Your approach should work, but you may have to come up with some new logic if you’re interested in partial coverage too. This is exactly the approach I’ve used myself for extracting genes within ranges:

Off the top of my head, you could check something like (pseudocode):

if feature.loc.start() or feature.loc.end() in range(block):
ADD REPLYlink modified 7 months ago • written 7 months ago by Joe16k

Thank you for the answer! I've thought of this but as I have many blocks for many genomes with many features I was wondering if there is more elegant way. I guess such a task may be quite common.

ADD REPLYlink written 7 months ago by little_more20

I’m not sure what you mean? What is the exact bit that’s stumping you?

ADD REPLYlink written 7 months ago by Joe16k

I'm worried that due to a large number of features and blocks this approach will take a lot of time. That's why I was asking for a more faster way (maybe there is a non-obvious way to do this using BioPython).

ADD REPLYlink written 7 months ago by little_more20

The sequence subsetting should be pretty rapid. You might encounter memory issues if you try to hold all the results in memory at once.

Do you need the entrez bit in your code or can you do it all locally? The entrez queries will likely be much slower than the block querying.

If that’s still too slow, (bio)python is not the language for you.

ADD REPLYlink written 7 months ago by Joe16k
gravatar for vkkodali
7 months ago by
United States
vkkodali1.7k wrote:

You can use the Entrez Direct tool efetch as follows:

efetch -db nuccore -id BA000007.3 -format gb -seq_start 1443423 -seq_stop 1444553

This will return a genbank flat file with the two partial gene features. This translates to the following in biopython:

with Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id="BA000007.3", seq_start="1443423", seq_stop="1444553") as handle: 
    seq_record =, "gb")

Now, the seq_record object has the two gene features:

>>> print(seq_record.features)
[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1131), strand=1), type='source'), 
SeqFeature(FeatureLocation(BeforePosition(0), AfterPosition(1131), strand=1), type='misc_feature'), 
SeqFeature(FeatureLocation(BeforePosition(0), ExactPosition(497), strand=1), type='gene'), 
SeqFeature(FeatureLocation(BeforePosition(0), ExactPosition(497), strand=1), type='CDS'), 
SeqFeature(FeatureLocation(ExactPosition(868), AfterPosition(1131), strand=1), type='gene'), 
SeqFeature(FeatureLocation(ExactPosition(868), AfterPosition(1131), strand=1), type='CDS')]
ADD COMMENTlink modified 7 months ago • written 7 months ago by vkkodali1.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2259 users visited in the last hour