Question

Indexing A List For Extracting X,Y,Z Coordinates From A Pdb File

1

Entering edit mode

11.9 years ago

s.charonis ▴ 100

Hello BioStar community,

I'm having a small issue with list indexing. I am extracting certain information from a PDB (protein information) file and need certain fields of the file to be copied into a list. The entries look like this:

ATOM 1512 N VAL A 222 8.544 -7.133 25.697 1.00 48.89 N
ATOM 1513 CA VAL A 222 8.251 -6.190 24.619 1.00 48.64 C
ATOM 1514 C VAL A 222 9.528 -5.762 23.898 1.00 48.32 C

I am using the following syntax to parse these lines into a list:

charged_res_coord = [] # store x,y,z of extracted charged resiudes 
   for line in pdb:
    if line.startswith('ATOM'):     
    atom_coord.append(line)

  for i in range(len(atom_coord)):
   for item in charged_res:
     if item in atom_coord[i]:
        charged_res_coord.append(atom_coord[i].split()[1:9])

The problem begins with entries such as the following.

ROW1) ATOM 1572 NH2 ARG A 228 7.890 -13.328 16.363 1.00 59.63 N

ROW2) ATOM 1617 N GLU A1005 11.906 -2.722 7.994 1.00 44.02 N

Here, the code that I use to extract the third spatial coordinate (the last of the three consecutive non-integer values) produces a problem:

because 'A1005' (second row) is considered as a single list entry, while 'A' and '228' (first row) are two list entries, when I use a loop to index the 7th element it extracts '16.363' (entry I want) for first row and 1.00 (not entry I want) for the second row.

chargedrescoord[1] ['1572', 'NH2', 'ARG', 'A', '228', '7.890', '-13.328', '16.363']

chargedrescoord[10] ['1617', 'N', 'GLU', 'A1005', '11.906', '-2.722', '7.994', '1.00']

The loop I use goes like this:

 for i in range(len(lys_charged_group)): 
     lys_charged_group[i][7] = float(lys_charged_group[i][7])

The [7] is the problem - in lines that are like ROW1 the code extracts the correct value, but in lines that are like ROW2 the code extracts the wrong value. Unfortunately, the different formats of rows are interspersed throughout the PDB file, reflecting the that both "A1000" and "A, 100" values occur so I don't know if I can solve this using textprocessing routines? Would I have to use regular expressions?

Many thanks for your help!

python pdb • 8.8k views

ADD COMMENT • link updated 11.9 years ago by Pierre Poulain ▴ 10 • written 11.9 years ago by s.charonis ▴ 100

0

Entering edit mode

As dimkal mentions, using existing parsers is probably the best way to go, however, if you insist on doing this your own way, then I would probably use a regex and pull out the individual groups into separate variables.

ADD REPLY • link 11.9 years ago by Steve Moss 2.3k

0

Entering edit mode

See my answer for the code :)

ADD REPLY • link 11.9 years ago by Steve Moss 2.3k

score 5 · Answer 1 · 2012-05-08

5

Entering edit mode

11.9 years ago

dimkal ▴ 730

You should really try to use already available libraries to read in PDB files. Why reinvent the wheel? There are really much more subtleties in reading in PDBs... ei: missing atoms, or even alternative coordinates for the same atom(s), etc... Have a look into BioPython libraries. Here is an example on how to do this.

Good luck

ADD COMMENT • link 11.9 years ago by dimkal ▴ 730

0

Entering edit mode

@dimkal Many thanks for the link!

ADD REPLY • link 11.9 years ago by s.charonis ▴ 100

0

Entering edit mode

There's a good PDF available on the BioPython site too - http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf

ADD REPLY • link 11.9 years ago by Steve Moss 2.3k

0

Entering edit mode

Thanks you for that! I was worried about using regexs because I'm not familiar with but it doesn't seem too difficult the way you wrote the code! Thanks for the link as well!

ADD REPLY • link 11.9 years ago by s.charonis ▴ 100

0

Entering edit mode

No problem! It's always best to use existing parsers though, rather than wrangling the data yourself. There may be occasions when that regex doesn't match the line, if there are some other ambiguities, which would result in you continually tweaking things. Using ready made PDB parsers should solve that!

ADD REPLY • link 11.9 years ago by Steve Moss 2.3k

0

Entering edit mode

The Python re (regular expression) module documentation can be found here http://docs.python.org/library/re.html. I also recommend the O'Reilly Regular Expressions books - http://search.oreilly.com/?q=regular+expressions&x=0&y=0 - I have the Pocket Reference one :)

ADD REPLY • link 11.9 years ago by Steve Moss 2.3k

score 2 · Answer 2 · 2012-05-08

Here's the code using a regex:

#!/usr/bin/env python
"""
Simple PDB parser
Coded by Steve Moss (gawbul [at] gmail [dot] com)
http://about.me/gawbul
"""

# import regex module
import re

# set fake pdb rows
pdb = ["ATOM 1512 N VAL A 222 8.544 -7.133 25.697 1.00 48.89 N", "ATOM 1513 CA VAL A 222 8.251 -6.190 24.619 1.00 48.64 C", "ATOM 1514 C VAL A 222 9.528 -5.762 23.898 1.00 48.32 C", "ATOM 1572 NH2 ARG A 228 7.890 -13.328 16.363 1.00 59.63 N", "ATOM 1617 N GLU A1005 11.906 -2.722 7.994 1.00 44.02 N"]

# create some variables
charged_res_coords = [] # store x,y,z of extracted charged resiudes 
charged_res = ["ARG", "HIS", "LYS", "ASP", "GLU"]

# compile regex
regex = re.compile("^ATOM\s([0-9]{4})\s([A-Z0-9]+)\s([A-Z]{3})\s([A-Z]{1}\s{0,1}[0-9]+)\s([0-9\.-]+)\s([0-9\.-]+)\s([0-9\.-]+)\s([0-9\.-]+)\s([0-9\.-]+)\s([A-Z]{1})$")

# iterate over lines in pdb
for line in pdb:
    # check if starts with ATOM
    if line.startswith('ATOM'): 
        # iterate over charged residues
        for res in charged_res:
            # is residue in line?
            if res in line:
                # check for match to regex
                match = regex.match(line)
                # pull out third spatial
                third_spatial = match.group(7)
                print third_spatial
                # append groups to list
                charged_res_coords.append(match.groups()[0:-3])

# print the stuff we added
print [item for item in charged_res_coords]

Which gives:

C:\Users\Steve\Dropbox\Code>python simple_pdb_parser.py
16.363
7.994
[('1572', 'NH2', 'ARG', 'A 228', '7.890', '-13.328', '16.363'), ('1617', 'N', 'GLU', 'A1005', '11.906', '-2.722', '7.994')]

score 1 · Answer 3 · 2012-05-14

Instead of string splitting and regular expression matching, I would rather recommend the string slicing technique. It should be more robust and it strictly follows the PDB format. From Steve's script, this should be:

#! /usr/bin/env python

"""
Simple PDB parser
Coded by Steve Moss (gawbul [at] gmail [dot] com)
http://about.me/gawbul

Slicing features by Pierre Poulain 
http://cupnet.net/
"""

# excerpts from PDB 3UNC
# http://www.rcsb.org/pdb/explore/explore.do?structureId=3UNC
pdb = [
"ATOM   6455  CA  HIS A 863      67.077 124.238  39.765  1.00 21.04           C  ",
"ATOM   6588  CA  ARG A 880      71.845 113.877  36.770  1.00 16.38           C  ",
"ATOM   6830  CA  PHE A 911      61.272 109.012  33.241  1.00 15.29           C  ",
"ATOM   8183  CA  ASP A1084      76.538  97.355  28.275  1.00 17.61           C  ",
"ATOM   8320  CA  LEU A1101      84.288  72.813  25.891  1.00 22.39           C  ",
"ATOM   8429  CA  GLU A1114      76.247  69.337  19.532  1.00 24.21           C  "
]

charged_res_coords = [] # store x,y,z of extracted charged resiudes 
charged_res = ["ARG", "HIS", "LYS", "ASP", "GLU"]

 # iterate over lines in pdb
for line in pdb:
    # check if line starts with "ATOM"
    if line.startswith('ATOM'):
        # define fields of interest
        atom_id = line[6:11].strip()
        atom_name = line[12:16].strip()
        res_name = line[17:20].strip()
        chain_name = line[21:22].strip()
        residue_id = line[22:26].strip()
        x = line[30:38].strip()
        y = line[38:46].strip()
        z = line[46:54].strip()
        if res_name in charged_res:
            # pull out third spatial (y-coordinate)
            print z
            # append groups to list
            charged_res_coords.append([atom_id, atom_name, res_name, chain_name, residue_id, x, y, z])

# print the stuff we added
print [item for item in charged_res_coords]

which gives:

[poulain@cumin ~]$ python parser.py 
39.765
36.770
28.275
19.532
[['6455', 'CA', 'HIS', 'A', '863', '67.077', '124.238', '39.765'], ['6588', 'CA', 'ARG', 'A', '880', '71.845', '113.877', '36.770'], ['8183', 'CA', 'ASP', 'A', '1084', '76.538', '97.355', '28.275'], ['8429', 'CA', 'GLU', 'A', '1114', '76.247', '69.337', '19.532']]

For a quick reminder, I have documented the ATOM line PDB format in a Python perspective here: http://cupnet.net/pdb-file-atom-line-memo/

As mentioned by dimkal, you should also give a try to the PDB parser implemented in BioPython.