Question: Modify location of a GenBank feature
1
gravatar for kavin.pl
4.8 years ago by
kavin.pl70
United Kingdom
kavin.pl70 wrote:

I am trying to modify the location of features within a GenBank file. I know feature.type will give gene/CDS and feature.qualifiers will then give "db_xref"/"locus_tag"/"inference" etc. Is there a feature. object which will allow me to access the location (eg: [5240:7267](+) ) directly? 

This URL give a bit more info, though I can't figure out how to use it for my purpose... http://biopython.org/DIST/docs/api/Bio.SeqFeature.SeqFeature-class.html#location_operator

 

 Essentially, I want to modify the following bit of a GenBank file:

     gene            5240..7267
                     /db_xref="GeneID:887081"
                     /locus_tag="Rv0005"
                     /gene="gyrB"
     CDS             5240..7267
                     /locus_tag="Rv0005"
                     /inference="protein motif:PROSITE:PS00177"
                     ...........................

to

     gene            5357..7267
                     /db_xref="GeneID:887081"
                     /locus_tag="Rv0005"
                     /gene="gyrB"
     CDS             5357..7267
                     /locus_tag="Rv0005"
                     /inference="protein motif:PROSITE:PS00177"
                     .............................
             
Note the changes from 5240 to 5357

So far I have the following python script:

    from Bio import SeqIO
    gb_file = "mtbtomod.gb"
    gb_record = SeqIO.parse(open(gb_file, "r+"), "genbank")
    rvnumber = 'Rv0005'
    newstart = 5357

    final_features = []

    for record in gb_record:
      for feature in record.features:
        if feature.type == "gene":
            if feature.qualifiers["locus_tag"][0] == rvnumber:
                if feature.location.strand == 1:
                    # Amend feature location from current to 'newstart'
                else:
                    # do the reverse for the complementary strand
        final_features.append(feature)
      record.features = final_features
      with open("test.gb","w") as test:
        SeqIO.write(record, test, "genbank")

Rv0005 is just an example of a locus_tag I need to update. I have about 600 more locations to update per GenBank file, and about 10-20 GenBank files to process (with more to come)

R genbank python • 2.5k views
ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by kavin.pl70
1

Cross-post from http://stackoverflow.com/questions/24636588/modify-location-of-a-genbank-feature

I guess that you didn't get an answer there. But the link and explanations over there can be useful.

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by Hugues250

Yes indeed - although the RE.type way is not totally appropriate and I thought there may be more GenBank centric people over here...

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by kavin.pl70
2
gravatar for Hugues
4.8 years ago by
Hugues250
Oslo, Norway
Hugues250 wrote:

Where is the question? Do you get an error message?

Well I do get one: when I do something like:

features.location.start.position = newstart

I get an error:

seqfeature AttributeError can't set attribute

And it doesn't matter whether the genbank file was open read-only as you did, or in writing mode.

I suppose that you need to create a new genbank file, and copy all the attributes.

Edit 1:

Well modifying the start is possible after all.

from Bio import SeqIO
from Bio import SeqFeature
start_pos = SeqFeature.AfterPosition( newstart )
end_pos = SeqFeature.BeforePosition( feature.location.end.position )
my_location = SeqFeature.FeatureLocation(start_pos, end_pos)
features = my_location

But that does not solve the saving problem yet.

Edit 2:

Useful links:

Add A New Feature In Biopython

Dealing with GenBank files in Biopython

Parsing Genbank files with Biopython

 

 

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Hugues250
1

Basically the "can't set attribute" error is the reason for my question. I don't know how to modify the locations while importing the features into a new GB file. Any ideas on how to do that?

All I can do is add a new qualifier through

feature.qualifiers["amend_position"] = "%s:%s" % (newstart, feature.location.end+1)

but that keeps the locations (ie, gene  5240..7267) the same while adding an extra qualifier (ie, /amend_position="5357:7267")

ADD REPLYlink written 4.8 years ago by kavin.pl70
1

"how to modify the locations" -> I've answered in my Edit 1. Please try it out and let me know.

"while importing the features into a new GB file" -> I haven't tried hard yet. Have a go at it yourself. Edit your question where you show us your attempts.

ADD REPLYlink written 4.8 years ago by Hugues250
1

As easy as this...

start_pos = SeqFeature.ExactPosition(newstart)
end_pos = SeqFeature.ExactPosition(feature.location.end.position)
my_location = SeqFeature.FeatureLocation(start_pos, end_pos, feature.location.strand)
feature.location = my_location

Thanks for pointing out the Seq.Feature implementation :)

Now, on to modifying the CDS as well, and making sure this works in the complementary strand.. All this before implementing a code to look through an excel sheet and pick up the strand and newstarts.

ADD REPLYlink written 4.8 years ago by kavin.pl70

Show me some love. Vote me up! I'm paid in rep. points :-)

ADD REPLYlink written 4.8 years ago by Hugues250
2
gravatar for kavin.pl
4.8 years ago by
kavin.pl70
United Kingdom
kavin.pl70 wrote:

Ok, I have something which now fully works. I'll post the code in case anyone ever needs something similar

__author__ = 'Kavin'

from Bio import SeqIO
from Bio import SeqFeature
import xlrd
import sys
import re

workbook = xlrd.open_workbook(sys.argv[2])
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]

# Create dicts to store TSS data
TSS = {}
row = {}
# For each entry (row), store the startcodon and strand information
for i in range(2, sheet.nrows - 1):
    if data[i][5] < -0.7:   # Ensures BASS score is within significant range
        Gene = data[i][0]
        row['Direction'] = str(data[i][3])
        row['StartCodon'] = int(data[i][4])
        TSS[str(Gene)] = row
        row = {}
    else:
        i += 1

# Create an output filename based on input filename
outfile_init = re.search('(.*)\.(\w*)', sys.argv[1])
outfile = str(outfile_init.group(1)) + '_modified.' + str(outfile_init.group(2))

final_features = []
for record in SeqIO.parse(open(sys.argv[1], "r"), "genbank"):
    for feature in record.features:
        if feature.type == "gene" or feature.type == "CDS":
            if TSS.has_key(feature.qualifiers["locus_tag"][0]):
                newstart = TSS[feature.qualifiers["locus_tag"][0]]['StartCodon']
                if feature.location.strand == 1:
                    feature.location = SeqFeature.FeatureLocation(SeqFeature.ExactPosition(newstart - 1),
                                                                  SeqFeature.ExactPosition(
                                                                      feature.location.end.position),
                                                                  feature.location.strand)
                else:
                    feature.location = SeqFeature.FeatureLocation(
                        SeqFeature.ExactPosition(feature.location.start.position),
                        SeqFeature.ExactPosition(newstart), feature.location.strand)
        final_features.append(feature)  # Append final features
    record.features = final_features
    with open(outfile, "w") as new_gb:
        SeqIO.write(record, new_gb, "genbank")

This assumes usage such as python program.py <genbankfile> <excel spreadsheet>

This also assumes a spreadsheet of the following format:

Table S3. TSS mapping and re-annotation of start codons.  
Gene Synonym Tuberculist annotated start Orientation Re-annotated start* BASS score*
Rv0005 gyrB 5240 + 5357 -1.782
Rv0012 Rv0012 14089 + 14134 -1.553
Rv0018c pstP 23181 - 23172 -2.077
Rv0032 bioF2 34295 + 34307 -0.842
Rv0037c Rv0037c 41202 - 41163 -0.554
Rv0038 Rv0038 41304 + 41307 -0.581

 

 

Edit:

I embellished the code a touch

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by kavin.pl70

Thanks for posting your code / solution, I'm sure it can help others.

ADD REPLYlink written 4.8 years ago by Hugues250
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1818 users visited in the last hour