Question: Printing Non Coding Sequence Preceeding A Cds Of Embl File
gravatar for Pappu
7.6 years ago by
Pappu1.9k wrote:

I have a huge embl file containing many sequences. I want to print non coding sequence before a CDS. For example I want to print 4670, 5026 given the id:E3HXX6 in an example file:

FT   gene            **4670..5026**
FT                   /locus_tag="AXYL_06645"
FT   CDS             4670..5026
FT                   /codon_start=1
FT                   /transl_table=11
FT                   /locus_tag="AXYL_06645"
FT                   /product="hypothetical protein"
FT                   /db_xref="EnsemblGenomes:ADP19930"
FT                   /db_xref="EnsemblGenomes:AXYL_06645"
FT                   /db_xref="UniProtKB/TrEMBL:**E3HXX6**"
FT                   /protein_id="ADP19930.1"
FT                   DQCRRPSTSETCSA"

So I am trying to use Biopython but it is very slow when I have a long list of ids and big EMBL file:

from Bio.SeqIO import parse
for rec in parse(open('test.embl','r'),'embl'):
    for i in rec.features:
        for j in id:        
          if j in str(i):
            print ' '.join(str(rec.features[index-3].location).split('](')[0].replace('[','').replace('>','').split(':'))

Let me know if you have any suggestions for faster code.

biopython • 2.0k views
ADD COMMENTlink modified 7.4 years ago by Biostar ♦♦ 20 • written 7.6 years ago by Pappu1.9k

Are you trying to get the gene coordinates?

What is it you are trying to do with the I & j variables? If you are looking for an db_xref, search that directly - right now you are needlessly turning the feature into a string repeatedly (which is partly what is making things slow).

The last line seems needlessly complex, does this not work?

feature = rec.features[index-3]
print "%i %i" % (feature.location.start, feature.location.end)
ADD REPLYlink modified 7.6 years ago • written 7.6 years ago by Peter5.9k

I suppose some kind of indexing will make it faster.

ADD REPLYlink written 7.6 years ago by Pappu1.9k

No, you don't need indexing for this. You need to avoid all the string manipulation like 'if f in str(i)' and the print statement and use the provided object attributes directly.

If you want to try indexing the features, read this:

ADD REPLYlink modified 7.6 years ago • written 7.6 years ago by Peter5.9k

Why do you take rec.features[index-3] when your description and the marking in the example suggests you want the previous feature?

ADD REPLYlink modified 7.6 years ago • written 7.6 years ago by Peter5.9k
gravatar for Peter
7.6 years ago by
Scotland, UK
Peter5.9k wrote:

You've still not explained what you are trying to do, but might this be closer?

>>> from Bio import SeqIO
>>> wanted_ids = set(['UniProtKB/TrEMBL:E3HXX6'])
>>> for rec in SeqIO.parse('test.embl', 'embl'):
...       index = 0
...       for cur_feature in rec.features:
...           index += 1
...           if wanted_ids.intersection(cur_feature.qualifiers.get("db_xref", [])):
...               old_feature = rec.features[index - 1]
...               print "%i %i" % (old_feature.location.start + 1, old_feature.location.end)
4670 5026

Note I'm giving the location of the previous feature with rec.features[index - 1] rather than what your example did which was three features back.

Also note the start + 1 in the print which is to convert from Python counting to normal one based counting as used in EMBL files.

You might also wish to add a filter if cur_feature.type == "CDS" to this which would probably make it faster.

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Peter5.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1494 users visited in the last hour