Question: Getting proteins from genomic interval of a genbank file
0
gravatar for fhsantanna
16 months ago by
fhsantanna490
Brazil
fhsantanna490 wrote:

I am trying to get all proteins from a genbank file considering a sequence interval. Here is the code of the script written in python:

#!/usr/bin/env python

#Script for getting proteins in a range from a genbank file
#Usage: python get_proteins.py file start:end

import sys
from Bio import SeqIO

rec = SeqIO.read(sys.argv[1], 'genbank')
feats = [feat for feat in rec.features if feat.type == "CDS"]
start, end = sys.argv[2].split(':')

desired = set(range(int(start),int(end),1))

for f in feats:
    span = set(range(f.location._start.position, f.location._end.position))
    if span & desired:
        print("%s,%s,%d:%d\n%s\n" % (
                    f.qualifiers['locus_tag'][0],
                    f.qualifiers['product'][0],
                    f.location._start.position,
                    f.location._end.position,
                    f.qualifiers["translation"][0]))

There are two problems: For some regions containing pseudogenes, printing "f.qualifiers["translation"][0]" breaks the script;

$ python get_proteins.py ../gbk/NZ_AOLM01000025.1.gbk 92686:113021 > island_proteins.faa
Traceback (most recent call last):
  File "get_proteins.py", line 24, in <module>
    f.qualifiers["translation"][0]))
KeyError: 'translation'

I know that I must use "try", but I was not successful in doing it.

The second problem is that some genbank files give the following error:

$ python get_proteins.py ../gbk/NZ_CM001555.1.gbk 530979:551299 > island_proteins.faa
Traceback (most recent call last): File "get_proteins.py", line 17, in> <module>
 span = set(range(f.location._start.position, f.location._end.position)) AttributeError: 'CompoundLocation' object
has no attribute '_start'

I did not understand this last problem.

Thank you.

biopython genbank proteins cds • 444 views
ADD COMMENTlink modified 16 months ago by Joe18k • written 16 months ago by fhsantanna490

I think I managed the problem:

#!/usr/bin/env python

#Script for getting proteins in a range from a genbank file
#Usage: python get_proteins.py file start:end


#!/usr/bin/env python

#This script is a modification of the script found in Peter Cock's site (http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/genbank2fasta/).
# Usage: python gbk2faa.py  start:end 

import sys
from Bio import GenBank
from Bio import SeqIO

input_handle  = open(sys.argv[1], "r")
output_handle = open(sys.argv[3], "w")

desired_interval = sys.argv[2].split(':')
start = int(desired_interval[0])
end = int(desired_interval[1])
#desired_interval = set(range(int(start),int(end),1))

for seq_record in SeqIO.parse(input_handle, "genbank") :
    print("Dealing with GenBank record %s" % seq_record.id)
    for seq_feature in seq_record[start:end].features :
        try: # Without "try", it crashes when it finds a CDS without translation (pseudogene).
            if seq_feature.type=="CDS" :
                assert len(seq_feature.qualifiers['translation'])==1
                output_handle.write(">%s,%s,%s,%s\n%s\n" % (
                    seq_feature.qualifiers['locus_tag'][0],
                    seq_feature.qualifiers['product'][0],
                    seq_record.id,
                    seq_record.description,
                    seq_feature.qualifiers['translation'][0]))
                pass
        except:
            continue

output_handle.close()
input_handle.close()
print("Done")
ADD REPLYlink written 16 months ago by fhsantanna490
3
gravatar for Joe
16 months ago by
Joe18k
United Kingdom
Joe18k wrote:

I've built something around that exact code before, for this problem: https://github.com/jrjhealey/bioinfo-tools/blob/master/get_spans.py

It requires that the file be a single contiguous genbank (for now). It prints the features, but it should be easy to alter the output to get fasta sequences or similar via Biopython.

ADD COMMENTlink written 16 months ago by Joe18k
1
gravatar for Pierre Lindenbaum
16 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

using xslt and then filter the TSV output with awk

:

ADD COMMENTlink modified 16 months ago • written 16 months ago by Pierre Lindenbaum130k

Thank you very much. But for 100,000 files this is not feasible.

ADD REPLYlink written 16 months ago by fhsantanna490

But for 100,000 files this is not feasible.

a loop ?

ADD REPLYlink written 16 months ago by Pierre Lindenbaum130k

Sorry, you are right. I did not pay attention. Maybe your solution is faster than mine. I'll test it. Thank you again.

ADD REPLYlink written 16 months ago by fhsantanna490

Could you provide your awk code, please?

ADD REPLYlink written 16 months ago by fhsantanna490
1

something like (not tested):

awk -v B=123  -v E=456 '{x1=int($2);x2=int($3);b=x1<x2?x1:x2;e=x1<x2?x2:x1; if (!(b>int(E) || e<int(B))) print; }' input.tsv
ADD REPLYlink modified 16 months ago • written 16 months ago by Pierre Lindenbaum130k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1387 users visited in the last hour