Question

Extract genbank features inside a given range

0

Entering edit mode

5.6 years ago

tim.ivanov.92 ▴ 40

How to retrieve all features inside a given range from genbank record, using biopython?

f1 = FeatureLocation(10, 40, strand=+1)
recs = [rec for rec in SeqIO.parse('my_genbank.gb', "genbank")]
feats = [feat for feat in rec.features if feat.type == "CDS"]
for feat in feats:
    if feat.location in f1:
        print feat.location
        print '\n'

i've tried this, but i get this error

ValueError: Currently we only support checking for integer positions being within a FeatureLocation.

biopython genbank • 3.4k views

ADD COMMENT • link updated 5.6 years ago by Joe 21k • written 5.6 years ago by tim.ivanov.92 ▴ 40

score 0 · Answer 1 · 2018-09-18

0

Entering edit mode

5.6 years ago

Joe 21k

I think you don't need your first line. From a quick read of the docs, FeatureLocation seems to be more used in creating features at specific locations than testing them. Consequently its expecting a simple integer, but you're giving it an object.

for example, the following works with my test data:

recs = [rec for rec in SeqIO.parse('my_genbank.gb', "genbank")]
feats = [feat for feat in rec.features if feat.type == "CDS"]
for feat in feats:
    if 456 in feat:    # Notice its only a simple integer
        print feat

# ----- Output -----

type: CDS
location: [7:457](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: inference, Value: ['ab initio prediction:Prodigal:2.60', 'protein motif:Pfam:PF06841.6']
    Key: locus_tag, Value: ['PAU_01961']
    Key: product, Value: ['T4-like virus tail tube protein gp19']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MSTTADQIAVQYPIPTYRFVVTIGDEQMCFQSVSGLDISYDTIEYRDGVGNWLQMPGQRQRPTITLKRGIFKGQSKLYDWINSISLNQIEKKDISISLTDETGSNLLITWNIANAFPEKLTAPSFDATSNEVAVQEISLKADRVTVEFH']

Incidentally, you can use normal python 'slicing' nomenclature to extract subregions of a genbank and write them out in any format you like with SeqIO.write. I exploit this for creating sub-setted genbanks in this script for instance:

https://github.com/jrjhealey/bioinfo-tools/blob/master/Genbank_slicer.py

Particularly line 121.

ADD COMMENT • link 5.6 years ago by Joe 21k

0

Entering edit mode

Thank you for your reply! Although, i still don't get it I need to check if the feature is inside a given range, not the other way around I try something like:

SeqIO.write(record, cwd+'/test_directory_for_trash/test_for_annotation.gb', "gb")
recs = [rec for rec in SeqIO.parse(cwd+'/test_directory_for_trash/test_for_annotation.gb', "genbank")]
feats = [feat for feat in rec.features if feat.type == "CDS"]
for feat in feats:
    if feat in [10000,12000]:
        print feat.location
        print '\n'

but i get this

ValueError: Locus identifier 'NW_016067404.1:1119446-1134925' is too long

ADD REPLY • link 5.6 years ago by tim.ivanov.92 ▴ 40

0

Entering edit mode

I don't think your approach will really work in that case.

Try the following instead:

from __future__ import print_function
import sys
from Bio import SeqIO

rec = SeqIO.read(sys.argv[1], 'genbank')
feats = [feat for feat in rec.features if feat.type == "CDS"]
start, end = sys.argv[2].split(':')

desired = set(xrange(int(start),int(end),1))

for f in feats:
    span = set(xrange(f.location._start.position, f.location._end.position))
    if span & desired:
        print(f)

call the script as:

python scriptname.py genbank.gb 123:456

I adapted this slightly from here, so there are answers out there already if you need more info.

If you want to filter by strand still, you can do that in addition by testing f further.

ADD REPLY • link 5.6 years ago by Joe 21k