Question: Extract genbank features inside a given range
0
gravatar for tim.ivanov.92
8 months ago by
tim.ivanov.9210 wrote:

How to retrieve all features inside a given range from genbank record, using biopython?

f1 = FeatureLocation(10, 40, strand=+1)
recs = [rec for rec in SeqIO.parse('my_genbank.gb', "genbank")]
feats = [feat for feat in rec.features if feat.type == "CDS"]
for feat in feats:
    if feat.location in f1:
        print feat.location
        print '\n'

i've tried this, but i get this error

ValueError: Currently we only support checking for integer positions being within a FeatureLocation.

genbank biopython • 275 views
ADD COMMENTlink modified 8 months ago by jrj.healey12k • written 8 months ago by tim.ivanov.9210
0
gravatar for jrj.healey
8 months ago by
jrj.healey12k
United Kingdom
jrj.healey12k wrote:

I think you don't need your first line. From a quick read of the docs, FeatureLocation seems to be more used in creating features at specific locations than testing them. Consequently its expecting a simple integer, but you're giving it an object.

for example, the following works with my test data:

recs = [rec for rec in SeqIO.parse('my_genbank.gb', "genbank")]
feats = [feat for feat in rec.features if feat.type == "CDS"]
for feat in feats:
    if 456 in feat:    # Notice its only a simple integer
        print feat

# ----- Output -----

type: CDS
location: [7:457](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: inference, Value: ['ab initio prediction:Prodigal:2.60', 'protein motif:Pfam:PF06841.6']
    Key: locus_tag, Value: ['PAU_01961']
    Key: product, Value: ['T4-like virus tail tube protein gp19']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MSTTADQIAVQYPIPTYRFVVTIGDEQMCFQSVSGLDISYDTIEYRDGVGNWLQMPGQRQRPTITLKRGIFKGQSKLYDWINSISLNQIEKKDISISLTDETGSNLLITWNIANAFPEKLTAPSFDATSNEVAVQEISLKADRVTVEFH']

Incidentally, you can use normal python 'slicing' nomenclature to extract subregions of a genbank and write them out in any format you like with SeqIO.write. I exploit this for creating sub-setted genbanks in this script for instance:

https://github.com/jrjhealey/bioinfo-tools/blob/master/Genbank_slicer.py

Particularly line 121.

ADD COMMENTlink modified 8 months ago • written 8 months ago by jrj.healey12k

Thank you for your reply! Although, i still don't get it I need to check if the feature is inside a given range, not the other way around I try something like:

SeqIO.write(record, cwd+'/test_directory_for_trash/test_for_annotation.gb', "gb")
recs = [rec for rec in SeqIO.parse(cwd+'/test_directory_for_trash/test_for_annotation.gb', "genbank")]
feats = [feat for feat in rec.features if feat.type == "CDS"]
for feat in feats:
    if feat in [10000,12000]:
        print feat.location
        print '\n'

but i get this

ValueError: Locus identifier 'NW_016067404.1:1119446-1134925' is too long

ADD REPLYlink written 8 months ago by tim.ivanov.9210

I don't think your approach will really work in that case.

Try the following instead:

from __future__ import print_function
import sys
from Bio import SeqIO

rec = SeqIO.read(sys.argv[1], 'genbank')
feats = [feat for feat in rec.features if feat.type == "CDS"]
start, end = sys.argv[2].split(':')

desired = set(xrange(int(start),int(end),1))

for f in feats:
    span = set(xrange(f.location._start.position, f.location._end.position))
    if span & desired:
        print(f)

call the script as:

python scriptname.py genbank.gb 123:456

I adapted this slightly from here, so there are answers out there already if you need more info.

If you want to filter by strand still, you can do that in addition by testing f further.

ADD REPLYlink modified 8 months ago • written 8 months ago by jrj.healey12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1876 users visited in the last hour