Question

Printing Non Coding Sequence Preceeding A Cds Of Embl File

0

Entering edit mode

10.7 years ago

Pappu ★ 2.1k

I have a huge embl file containing many sequences. I want to print non coding sequence before a CDS. For example I want to print 4670, 5026 given the id:E3HXX6 in an example file: http://www.ebi.ac.uk/ena/data/view/CP002289&display=text

FT   gene            **4670..5026**
FT                   /locus_tag="AXYL_06645"
FT   CDS             4670..5026
FT                   /codon_start=1
FT                   /transl_table=11
FT                   /locus_tag="AXYL_06645"
FT                   /product="hypothetical protein"
FT                   /db_xref="EnsemblGenomes:ADP19930"
FT                   /db_xref="EnsemblGenomes:AXYL_06645"
FT                   /db_xref="UniProtKB/TrEMBL:**E3HXX6**"
FT                   /protein_id="ADP19930.1"
FT                   /translation="MCYSAQLVADFKRVTRQYGVKIAFEDFVDLYLYHSTDSRPKTPRA
FT                   LDVAFTAADGPAGAAIQEAVRRWDQLEMQELEDLVFAQRVRLVSAEQALQKKVTKKAEN
FT                   DQCRRPSTSETCSA"

So I am trying to use Biopython but it is very slow when I have a long list of ids and big EMBL file:

from Bio.SeqIO import parse
id,cds=['E3HXX6'],[]
count,index=1,0
for rec in parse(open('test.embl','r'),'embl'):
    for i in rec.features:
        index+=1
        for j in id:        
          if j in str(i):
            print ' '.join(str(rec.features[index-3].location).split('](')[0].replace('[','').replace('>','').split(':'))

Let me know if you have any suggestions for faster code.

biopython • 2.5k views

ADD COMMENT • link updated 10.6 years ago by Biostar 20 • written 10.7 years ago by Pappu ★ 2.1k

0

Entering edit mode

Are you trying to get the gene coordinates?

What is it you are trying to do with the I & j variables? If you are looking for an db_xref, search that directly - right now you are needlessly turning the feature into a string repeatedly (which is partly what is making things slow).

The last line seems needlessly complex, does this not work?

feature = rec.features[index-3]
print "%i %i" % (feature.location.start, feature.location.end)

ADD REPLY • link 10.7 years ago by Peter 6.0k

0

Entering edit mode

I suppose some kind of indexing will make it faster.

ADD REPLY • link 10.7 years ago by Pappu ★ 2.1k

0

Entering edit mode

No, you don't need indexing for this. You need to avoid all the string manipulation like 'if f in str(i)' and the print statement and use the provided object attributes directly.

If you want to try indexing the features, read this: www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features

ADD REPLY • link 10.7 years ago by Peter 6.0k

0

Entering edit mode

Why do you take rec.features[index-3] when your description and the marking in the example suggests you want the previous feature?

ADD REPLY • link 10.7 years ago by Peter 6.0k

score 1 · Answer 1 · 2013-08-01

You've still not explained what you are trying to do, but might this be closer?

>>> from Bio import SeqIO
>>> wanted_ids = set(['UniProtKB/TrEMBL:E3HXX6'])
>>> for rec in SeqIO.parse('test.embl', 'embl'):
...       index = 0
...       for cur_feature in rec.features:
...           index += 1
...           if wanted_ids.intersection(cur_feature.qualifiers.get("db_xref", [])):
...               old_feature = rec.features[index - 1]
...               print "%i %i" % (old_feature.location.start + 1, old_feature.location.end)
... 
4670 5026

Note I'm giving the location of the previous feature with rec.features[index - 1] rather than what your example did which was three features back.

Also note the start + 1 in the print which is to convert from Python counting to normal one based counting as used in EMBL files.

You might also wish to add a filter if cur_feature.type == "CDS" to this which would probably make it faster.