Question: Biopython: work with loops and qualifiers from multiple records
1
gravatar for dago
3.0 years ago by
dago2.6k
Germany
dago2.6k wrote:

I have a gbk of a draft bacterial genomes (meaning multiple contigs). I want to extract some info about specific genes, and I get an error I cannot understand.

Parse the gbk

record_dict = SeqIO.to_dict(SeqIO.parse("SpeciesA.gbk", "genbank"))

My file looks:

    for i in record_dict:
    ...     for f in record_dict[i].features:
    ...             print f.qualifiers
{'locus_tag': ['AA_03640'], 'gene': ['ftsH_4']}
{'locus_tag': ['AA_03640'], 'inference': ['ab initio prediction:Prodigal:2.6', 'similar to AA sequence:UniProtKB:P37476'], 'codon_start': ['1'], 'EC_number': ['3.4.24.-'], 'transl_table': ['11'], 'product': ['ATP-dependent zinc metalloprotease FtsH'], ........

Now if I try to get out the contig name where the gene AA_03640 is in, as follow:

for i in record_dict:
...     for f in record_dict[i].features:
...             if f.qualifiers['locus_tag'] == 'AA_03640':
                    print(i)

But I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
KeyError: 'locus_tag'

Any help to understand where I am wrong? Thanks

ADD COMMENTlink modified 3.0 years ago by Peter5.8k • written 3.0 years ago by dago2.6k
1

The error is quite self-explanatory I would say, and means that f.qualifiers doesn't have a key "locus_tag"...

You can check which keys it does have using f.qualifiers.keys()

ADD REPLYlink written 3.0 years ago by WouterDeCoster44k
1

I thought the same, but when I check the keys locus_tag is in there:

['locus_tag', 'gene']
['locus_tag', 'inference', 'codon_start', 'EC_number', 'transl_table', 'product', 'translation', 'gene', 'protein_id']
ADD REPLYlink written 3.0 years ago by dago2.6k
1

Odd. For a moment I thought this was an easy question ;-)

Maybe use a try-except block:

try:
    if f.qualifiers['locus_tag'] == 'AA_03640':
        #stuff
except KeyError:
    print(f.qualifiers)

An unrelated problem is that you should check for f.qualifiers['locus_tag'] == ['AA_03640'], since it's a list and not a character.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by WouterDeCoster44k

ok, I am officially confused. If I run:

for i in record_dict:
    for f in record_dict[i].features:
        print f.qualifiers.keys()

I get

['locus_tag', 'inference', 'codon_start', 'product', 'transl_table', 'translation', 'protein_id']
['locus_tag']
['locus_tag', 'inference', 'codon_start', 'product', 'transl_table', 'translation', 'protein_id']
['locus_tag']

But If I run

for i in record_dict:
    for f in record_dict[i].features:
        try:
            if f.qualifiers['locus_tag'] == ['AA_03640']:
                print i
        except KeyError:
            print(f.qualifiers.keys())

I get

['estimated_length', 'linkage_evidence', 'gap_type']
['strain', 'mol_type', 'organism']
['strain', 'mol_type', 'organism']

I should have the same keys, shouldn't I?

ADD REPLYlink written 3.0 years ago by dago2.6k
2

To answer your question, no, you should not assume all the features have the same keys. It varies dramatically by the feature type - the source feature (usually there is only one) has things like the organism, while while CDS and gene features have things like a locus tag, and the CDS often has a translation. You probably want to add if f.type == "CDS": or if f.type == "gene": to your loop. See also http://www.warwick.ac.uk/go/peter_cock/python/genbank which does something similar to what I think you want to achieve.

(I have expanded on this as a full answer)

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Peter5.8k

I am officially confused

I join being confused.

I don't really have experience with genbank files so I can't really nail this down. Maybe something for Peter, although I'm not sure if he visits biostars often...

ADD REPLYlink written 3.0 years ago by WouterDeCoster44k

Double check your input genbank isn't malformed and that its /locus_tag fields are correct?

If you hit Peter up on twitter he's been really helpful to me in the past (and hes always on twitter ;) )

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Joe17k

I checked the gbk, it looks fine. I will try twitter..:)

ADD REPLYlink written 3.0 years ago by dago2.6k
2

Someone kindling pinged me on Twitter

ADD REPLYlink written 3.0 years ago by Peter5.8k

They were faster then me!=)

ADD REPLYlink written 3.0 years ago by dago2.6k
2
gravatar for Peter
3.0 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

Not all feature types will have a locus tag - you probably only care about CDS or gene features. Even then, not genes will have a locus tag (although generally within a record they either all do, or all don't).

Finally if there is a locus tag, then f.qualifiers['locus_tag'] will return a list (since in general the annotation qualifiers can appear more than once), so you'd want to check using 'AA_03640' in f.qualifiers['locus_tag'] or similar, e.g.

for i in record_dict:
    for f in record_dict[i].features:
        if f.type == "CDS":
            if 'AA_03640' in f.qualifiers.get('locus_tag', []):
                print(i)

See also http://www.warwick.ac.uk/go/peter_cock/python/genbank which does something similar to what I think you want to achieve.

ADD COMMENTlink written 3.0 years ago by Peter5.8k

Hi, thank you very much for the answer! You are right! Adding if f.type == "CDS" fixed it!

ADD REPLYlink written 3.0 years ago by dago2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 949 users visited in the last hour