Question

Adding Consensus Numbering To An Antibody Amino Acid Sequence In Biopython

1

Entering edit mode

13.7 years ago

Jared Sampson ▴ 10

I would like to add the consensus numbering, as determined by Andrew Smith's Abnum utility, to a Biopython Seq object. After browsing the Biopython Tutorial and Cookbook and Bio module source for a while, and several hours of Google searching over a few days, I haven't been able to find a clear answer on whether this is possible.

The end goal is to create a Django web app to compare various antibody sequences with their germline counterparts and, eventually, structural information. For this, it will be helpful to have not only the index of the residue in the sequence, but the Kabat (or Chothia, etc.) position as well.

For example, the Abnum output for an antibody light chain with the amino acid sequence SYVLTQPPSVS... looks like this:

L1 S
L2 Y
L3 V
L4 L
L5 T
L6 Q
L7 P
L8 P
L9 S
L10 -
L11 V
L12 S
...

The trouble lies in the gaps (e.g. L10 -) and insertions (e.g. H99, H100, H100A ... H100G, H101 ). To properly refer to a residue (particularly when using the structural info), I need to be able to use its consensus number.

Ideally, this would be an attribute added to the individual residue of the sequence:

from Bio.Seq import Seq
from Bio.Alphabet.IUPAC import IUPACProtein

# create the sequence object
s = Seq('SYVLTQPPSVS...', IUPACProtein)

# after reading the Abnum output into an array, 
# loop through both the Abnum array and the sequence,
# perhaps something like this, for starters
i=0
for line in abnum_output:
    arr = line.split(' ')
    if arr[1] != '-':
        s[i].kabat_number = arr[0]
        i++

What I haven't been able to find is a way to add such an attribute to a single residue, rather than to the entire chain. I'm hoping someone out there might have a solution. Any help will be greatly appreciated! Thanks.

biopython sequence • 3.7k views

ADD COMMENT • link updated 13.7 years ago by Peter 6.0k • written 13.7 years ago by Jared Sampson ▴ 10

score 1 · Answer 1 · 2011-10-27

First of all, have you considered using a gapped Seq object? Given the output it appears you would be better off working with that than an ungapped sequence.

The Biopython object heirachy is to keep the Seq object simple, and store any annotation in the SeqRecord object instead.

I'd therefore suggest using a SeqRecord object and its letter_annotations attribute, essentially a dictionary of information on a per-letter basis (e.g. Quality scores in sequencing). In this case, store a list of consensus numbers. e.g.

from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio.Alphabet import Gapped
from Bio.Alphabet.IUPAC import IUPACProtein

# create the sequence object
s = Seq('SYVLTQPPS-VS', Gapped(IUPACProtein))
con_numbers = ["L1", "L2", "L3", "L4", "L5", "L6", "L7", "L8", "L9", "L10", "L12", "L13"]
r = SeqRecord(s, id="Hello", letter_annotations={"con_num":con_numbers})

for con, letter in zip(r.seq, r.letter_annotations["con_num"]):
    print con, letter

In this case you could also consider using a plain string and list for the gapped sequence and its coordinates.