How to remove features from GBK files with biopython ?
0
0
Entering edit mode
20 months ago
hugo.avila ▴ 190

I need to remove features from a gbk file that have split coordinates:

join{[5355286:5355459](+), [0:17](+)}
join{[5355286:5355459](+), [0:17](+)}


So far i was able to test features to check if they are CompaundLocation types, but i culd not remove all off then from the record object:

#!/usr/bin/env python

from Bio import SeqIO
import sys

records = SeqIO.parse(sys.argv[1],"genbank")

for record in records:
for index, feature in enumerate(record.features):
try:
if feature.location.operator == "join":
record.features.pop(index)
except:
pass
SeqIO.write(record, "out.gbk", "genbank")

biopython • 618 views
0
Entering edit mode

Can you post on the genbank accession number for the file you are playing with so we can test some code?

1
Entering edit mode

Not at a computer to test right now, but it should be possible to do something like:

records.features.remove(feature)


If the features element is a list (I forget now). However, what you'll need to do is find out exactly what the feature object is, or its index within the genome. Your current approach should be what you need, but you want remove() instead of pop.

Edit: see comment below, del needed instead of pop

0
Entering edit mode

Thank you for the help. Replace pop() for remove() works, but only for genes, CDS still present on the output file.

Before:

FEATURES             Location/Qualifiers
source          1..5355459
/organism="Klebsiella pneumoniae subsp. pneumoniae"
/mol_type="genomic DNA"
/strain="TGH10"
/isolation_source="wound swab"
/host="Homo sapiens"
/sub_species="pneumoniae"
/db_xref="taxon:72407"
/country="Greece"
/lat_lon="37.97 N 23.73 E"
/collection_date="02-Mar-2013"
/collected_by="Olympia"
**gene            join(5355287..5355459,1..17)**
/locus_tag="AOG31_RS26915"
/pseudo=""
**CDS             join(5355287..5355459,1..17)**
/locus_tag="AOG31_RS26915"
/inference="COORDINATES: similar to AA
sequence:RefSeq:WP_006808765.1"
/note="frameshifted; internal stop; Derived by automated
computational analysis using gene prediction method:
Protein Homology."
/pseudo=""
/codon_start=1
/transl_table=11
/product="hypothetical protein"


After:

FEATURES             Location/Qualifiers
source          1..5355459
/organism="Klebsiella pneumoniae subsp. pneumoniae"
/mol_type="genomic DNA"
/strain="TGH10"
/isolation_source="wound swab"
/host="Homo sapiens"
/sub_species="pneumoniae"
/db_xref="taxon:72407"
/country="Greece"
/lat_lon="37.97 N 23.73 E"
/collection_date="02-Mar-2013"
/collected_by="Olympia"
**CDS             join(5355287..5355459,1..17)**
/locus_tag="AOG31_RS26915"
/inference="COORDINATES: similar to AA
sequence:RefSeq:WP_006808765.1"
/note="frameshifted; internal stop; Derived by automated
computational analysis using gene prediction method:
Protein Homology."
/pseudo=""
/codon_start=1
/transl_table=11
/product="hypothetical protein"

0
Entering edit mode

Hi, sorry i forgot to mention, here is the acc of the file NZ_CP012744.1.

1
Entering edit mode

I think the approach you have above will work, but you need del instead of pop.

Alternatively you can use filtering list comprehensions or filter functions to create a new list of features, e.g:

feat = [feature for feature in [feature for feature in feat  if feature.type == 'gene']]


will remove anything that has the feature type 'gene'. You can replace this with logic to remove joins instead, then re-write a new genbank file with the modified list of features.

0
Entering edit mode

This works:

#!/usr/bin/env python

from Bio import SeqIO
import sys

def check_feature(feature):
try:
if feature.location.operator == "join":
return True
except:
return False

#save only the first record
record = [i for i in SeqIO.parse(sys.argv[1],"genbank")][0]

features = [feature for feature in [feature for feature in record.features if not check_feature(feature)]]
SeqIO.write(features, "out.gbk", "genbank")


But i can't write extracted features to a new file. Error message:

Traceback (most recent call last):
File "./filter.py", line 17, in <module>
SeqIO.write(features, "out.gbk", "genbank")
File "/home/hugo/miniconda3/lib/python3.7/site-packages/Bio/SeqIO/__init__.py", line 539, in write
count = writer_class(fp).write_file(sequences)
File "/home/hugo/miniconda3/lib/python3.7/site-packages/Bio/SeqIO/Interfaces.py", line 253, in write_file
count = self.write_records(records)
File "/home/hugo/miniconda3/lib/python3.7/site-packages/Bio/SeqIO/Interfaces.py", line 238, in write_records
self.write_record(record)
File "/home/hugo/miniconda3/lib/python3.7/site-packages/Bio/SeqIO/InsdcIO.py", line 862, in write_record
self._write_the_first_line(record)
File "/home/hugo/miniconda3/lib/python3.7/site-packages/Bio/SeqIO/InsdcIO.py", line 596, in _write_the_first_line
locus = record.name
AttributeError: 'SeqFeature' object has no attribute 'name'

0
Entering edit mode

You dont want to just write the features, you need to add those features to a new genbank (or whatever).

You're trying to write the record, but passing it a list of the records features instead.

0
Entering edit mode

Ok i will do it Thanks for all the help !