Adding Consensus Numbering To An Antibody Amino Acid Sequence In Biopython
1
1
Entering edit mode
13.7 years ago

I would like to add the consensus numbering, as determined by Andrew Smith's Abnum utility, to a Biopython Seq object. After browsing the Biopython Tutorial and Cookbook and Bio module source for a while, and several hours of Google searching over a few days, I haven't been able to find a clear answer on whether this is possible.

The end goal is to create a Django web app to compare various antibody sequences with their germline counterparts and, eventually, structural information. For this, it will be helpful to have not only the index of the residue in the sequence, but the Kabat (or Chothia, etc.) position as well.

For example, the Abnum output for an antibody light chain with the amino acid sequence SYVLTQPPSVS... looks like this:

L1 S
L2 Y
L3 V
L4 L
L5 T
L6 Q
L7 P
L8 P
L9 S
L10 -
L11 V
L12 S
...

The trouble lies in the gaps (e.g. L10 -) and insertions (e.g. H99, H100, H100A ... H100G, H101 ). To properly refer to a residue (particularly when using the structural info), I need to be able to use its consensus number.

Ideally, this would be an attribute added to the individual residue of the sequence:

from Bio.Seq import Seq
from Bio.Alphabet.IUPAC import IUPACProtein

# create the sequence object
s = Seq('SYVLTQPPSVS...', IUPACProtein)

# after reading the Abnum output into an array, 
# loop through both the Abnum array and the sequence,
# perhaps something like this, for starters
i=0
for line in abnum_output:
    arr = line.split(' ')
    if arr[1] != '-':
        s[i].kabat_number = arr[0]
        i++

What I haven't been able to find is a way to add such an attribute to a single residue, rather than to the entire chain. I'm hoping someone out there might have a solution. Any help will be greatly appreciated! Thanks.

biopython sequence • 3.7k views
ADD COMMENT
1
Entering edit mode
13.7 years ago
Peter 6.0k

First of all, have you considered using a gapped Seq object? Given the output it appears you would be better off working with that than an ungapped sequence.

The Biopython object heirachy is to keep the Seq object simple, and store any annotation in the SeqRecord object instead.

I'd therefore suggest using a SeqRecord object and its letter_annotations attribute, essentially a dictionary of information on a per-letter basis (e.g. Quality scores in sequencing). In this case, store a list of consensus numbers. e.g.

from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio.Alphabet import Gapped
from Bio.Alphabet.IUPAC import IUPACProtein

# create the sequence object
s = Seq('SYVLTQPPS-VS', Gapped(IUPACProtein))
con_numbers = ["L1", "L2", "L3", "L4", "L5", "L6", "L7", "L8", "L9", "L10", "L12", "L13"]
r = SeqRecord(s, id="Hello", letter_annotations={"con_num":con_numbers})

for con, letter in zip(r.seq, r.letter_annotations["con_num"]):
    print con, letter

In this case you could also consider using a plain string and list for the gapped sequence and its coordinates.

ADD COMMENT
0
Entering edit mode

Thanks for the tip, @Peter -- I think that's just what I was after. Much appreciated!

One quick edit: it seems your con_numbers list should have "L11" and "L12" as the last numbers for this particular example...the letter_annotations are assigned to the gap positions as well.)

ADD REPLY
0
Entering edit mode

I was reproducing your example - I take it you wanted something a little different? Well anyway, the same basic idea should work.

ADD REPLY

Login before adding your answer.

Traffic: 2063 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6