Question: biopython.PDB residue.get_id() does not match sequence index
gravatar for stefan-3000
13 months ago by
stefan-30000 wrote:

Hi all,

I would like to retrieve the sequence index position for an amino acid residue from a .cif file using Biopython's PDB package. My goal is to get the sequence index of spatially neighboring AAs using Neighborsearch I have the following piece of code.


import Biopython.PDB as bpdb

structure = bpdb.MMCIFParser().get_structure('subt', 'file.cif')

for chain in structure[0]:
    for res in chain:
        for atom in res:
neighbor = bpdb.NeighborSearch(atoms)

coords = []
for idx, residue in enumerate(structure[0].get_residues()):
    if bpdb.is_aa(residue):
        name = residue.get_resname()
        ca_atom = residue['CA']
        coords.append((idx, residue, ca_atom.get_coord()))
        occupancy = ca_atom.get_occupancy()
        if 33 < idx < 40:
            print(f'AA:{name}, idx:{idx}, get_id:{residue.get_id()[1]}')


AA:ILE, idx:34, get_id:35
AA:SER, idx:35, get_id:36
AA:THR, idx:36, get_id:38
AA:HIS, idx:37, get_id:39
AA:PRO, idx:38, get_id:40
AA:ASP, idx:39, get_id:41

Now as you can see, the sequential index idx gives the real primary structure position of the AA, but the index, retrieved with the residue.get_id() function, sometimes jumps (as can be seen at idx:37) and therefore seems to indicate the wrong position. The problem becomes clear when searching for spatial neighbors, since I would like to find the (correct) sequence index of those neighbors:


radius = 5
coord = coords[36][2]
neighbors =, radius, level='R')

for item in neighbors:


<Residue SER het=  resseq=36 icode= >
<Residue HOH het=W resseq=1135 icode= >
<Residue THR het=  resseq=38 icode= >

But as you can see is is not possible to trace back the position of THR (get_id:38 against idx:36) Why does it jump and how can I get the correct get_id()?

Hope someone can help me out,


ADD COMMENTlink modified 13 months ago by fishgolden450 • written 13 months ago by stefan-30000
gravatar for stefan-3000
13 months ago by
stefan-30000 wrote:

I managed to find a workaround which gives me the idx for the neighboring AAs:

radius = 5
coord = coords[36][2]
neighbors =, radius, level='R')

sequence_index = []
for item in neighbors:
    for i in range(len(coords)):
        if item == coords[i][1]:

However this still does not answer my question why the get_id() skips a position in some cases, I have a feeling it has something to do with the water molecules (Residue HOH), but cannot find the arguments for that statement...

ADD COMMENTlink written 13 months ago by stefan-30000
gravatar for fishgolden
13 months ago by
fishgolden450 wrote:

I'm not using biopython but as I thought this is one of the general pit fall of handling PDB structure, I give an answer. (please give a comment if it isn't)

Why idx and id are different is because "id" means not "index" but "identifier".

Residues in PDB have several ids. There are two such ids at least.



Looking at your print result of the elements, "resseq" must be resSeq written in here

Thus they are the numbers given by the author (must be auth_seq_id), I guess.

Why it jumps is, most of the cases, because there are "missing residues (residues do not show in structure. thus do not have coordinates)".

Or sometimes, experiments were performed without those residues.

ADD COMMENTlink modified 13 months ago • written 13 months ago by fishgolden450
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 953 users visited in the last hour