Printing Non Coding Sequence Preceeding A Cds Of Embl File
1
0
Entering edit mode
10.7 years ago
Pappu ★ 2.1k

I have a huge embl file containing many sequences. I want to print non coding sequence before a CDS. For example I want to print 4670, 5026 given the id:E3HXX6 in an example file: http://www.ebi.ac.uk/ena/data/view/CP002289&display=text

FT   gene            **4670..5026**
FT                   /locus_tag="AXYL_06645"
FT   CDS             4670..5026
FT                   /codon_start=1
FT                   /transl_table=11
FT                   /locus_tag="AXYL_06645"
FT                   /product="hypothetical protein"
FT                   /db_xref="EnsemblGenomes:ADP19930"
FT                   /db_xref="EnsemblGenomes:AXYL_06645"
FT                   /db_xref="UniProtKB/TrEMBL:**E3HXX6**"
FT                   /protein_id="ADP19930.1"
FT                   /translation="MCYSAQLVADFKRVTRQYGVKIAFEDFVDLYLYHSTDSRPKTPRA
FT                   LDVAFTAADGPAGAAIQEAVRRWDQLEMQELEDLVFAQRVRLVSAEQALQKKVTKKAEN
FT                   DQCRRPSTSETCSA"

So I am trying to use Biopython but it is very slow when I have a long list of ids and big EMBL file:

from Bio.SeqIO import parse
id,cds=['E3HXX6'],[]
count,index=1,0
for rec in parse(open('test.embl','r'),'embl'):
    for i in rec.features:
        index+=1
        for j in id:        
          if j in str(i):
            print ' '.join(str(rec.features[index-3].location).split('](')[0].replace('[','').replace('>','').split(':'))

Let me know if you have any suggestions for faster code.

biopython • 2.5k views
ADD COMMENT
0
Entering edit mode

Are you trying to get the gene coordinates?

What is it you are trying to do with the I & j variables? If you are looking for an db_xref, search that directly - right now you are needlessly turning the feature into a string repeatedly (which is partly what is making things slow).

The last line seems needlessly complex, does this not work?

feature = rec.features[index-3]
print "%i %i" % (feature.location.start, feature.location.end)
ADD REPLY
0
Entering edit mode

I suppose some kind of indexing will make it faster.

ADD REPLY
0
Entering edit mode

No, you don't need indexing for this. You need to avoid all the string manipulation like 'if f in str(i)' and the print statement and use the provided object attributes directly.

If you want to try indexing the features, read this: www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features

ADD REPLY
0
Entering edit mode

Why do you take rec.features[index-3] when your description and the marking in the example suggests you want the previous feature?

ADD REPLY
1
Entering edit mode
10.7 years ago
Peter 6.0k

You've still not explained what you are trying to do, but might this be closer?

>>> from Bio import SeqIO
>>> wanted_ids = set(['UniProtKB/TrEMBL:E3HXX6'])
>>> for rec in SeqIO.parse('test.embl', 'embl'):
...       index = 0
...       for cur_feature in rec.features:
...           index += 1
...           if wanted_ids.intersection(cur_feature.qualifiers.get("db_xref", [])):
...               old_feature = rec.features[index - 1]
...               print "%i %i" % (old_feature.location.start + 1, old_feature.location.end)
... 
4670 5026

Note I'm giving the location of the previous feature with rec.features[index - 1] rather than what your example did which was three features back.

Also note the start + 1 in the print which is to convert from Python counting to normal one based counting as used in EMBL files.

You might also wish to add a filter if cur_feature.type == "CDS" to this which would probably make it faster.

ADD COMMENT

Login before adding your answer.

Traffic: 1758 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6