Indexing A List For Extracting X,Y,Z Coordinates From A Pdb File
3
1
Entering edit mode
8.9 years ago
s.charonis ▴ 70

Hello BioStar community,

I'm having a small issue with list indexing. I am extracting certain information from a PDB (protein information) file and need certain fields of the file to be copied into a list. The entries look like this:

ATOM 1512 N VAL A 222 8.544 -7.133 25.697 1.00 48.89 N
ATOM 1513 CA VAL A 222 8.251 -6.190 24.619 1.00 48.64 C
ATOM 1514 C VAL A 222 9.528 -5.762 23.898 1.00 48.32 C

I am using the following syntax to parse these lines into a list:

charged_res_coord = [] # store x,y,z of extracted charged resiudes
for line in pdb:
if line.startswith('ATOM'):
atom_coord.append(line)

for i in range(len(atom_coord)):
for item in charged_res:
if item in atom_coord[i]:
charged_res_coord.append(atom_coord[i].split()[1:9])


The problem begins with entries such as the following.

ROW1) ATOM 1572 NH2 ARG A 228 7.890 -13.328 16.363 1.00 59.63 N

ROW2) ATOM 1617 N GLU A1005 11.906 -2.722 7.994 1.00 44.02 N

Here, the code that I use to extract the third spatial coordinate (the last of the three consecutive non-integer values) produces a problem:

because 'A1005' (second row) is considered as a single list entry, while 'A' and '228' (first row) are two list entries, when I use a loop to index the 7th element it extracts '16.363' (entry I want) for first row and 1.00 (not entry I want) for the second row.

chargedrescoord[1] ['1572', 'NH2', 'ARG', 'A', '228', '7.890', '-13.328', '16.363']

chargedrescoord[10] ['1617', 'N', 'GLU', 'A1005', '11.906', '-2.722', '7.994', '1.00']

The loop I use goes like this:

 for i in range(len(lys_charged_group)):
lys_charged_group[i][7] = float(lys_charged_group[i][7])


The [7] is the problem - in lines that are like ROW1 the code extracts the correct value, but in lines that are like ROW2 the code extracts the wrong value. Unfortunately, the different formats of rows are interspersed throughout the PDB file, reflecting the that both "A1000" and "A, 100" values occur so I don't know if I can solve this using textprocessing routines? Would I have to use regular expressions?

python pdb • 6.9k views
0
Entering edit mode

As dimkal mentions, using existing parsers is probably the best way to go, however, if you insist on doing this your own way, then I would probably use a regex and pull out the individual groups into separate variables.

0
Entering edit mode

See my answer for the code :)

5
Entering edit mode
8.9 years ago
dimkal ▴ 730

You should really try to use already available libraries to read in PDB files. Why reinvent the wheel? There are really much more subtleties in reading in PDBs... ei: missing atoms, or even alternative coordinates for the same atom(s), etc... Have a look into BioPython libraries. Here is an example on how to do this.

Good luck

0
Entering edit mode

@dimkal Many thanks for the link!

0
Entering edit mode

There's a good PDF available on the BioPython site too - http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf

0
Entering edit mode

Thanks you for that! I was worried about using regexs because I'm not familiar with but it doesn't seem too difficult the way you wrote the code! Thanks for the link as well!

0
Entering edit mode

No problem! It's always best to use existing parsers though, rather than wrangling the data yourself. There may be occasions when that regex doesn't match the line, if there are some other ambiguities, which would result in you continually tweaking things. Using ready made PDB parsers should solve that!

0
Entering edit mode

The Python re (regular expression) module documentation can be found here http://docs.python.org/library/re.html. I also recommend the O'Reilly Regular Expressions books - http://search.oreilly.com/?q=regular+expressions&x=0&y=0 - I have the Pocket Reference one :)

2
Entering edit mode
8.9 years ago

Here's the code using a regex:

#!/usr/bin/env python
"""
Simple PDB parser
Coded by Steve Moss (gawbul [at] gmail [dot] com)
"""

# import regex module
import re

# set fake pdb rows
pdb = ["ATOM 1512 N VAL A 222 8.544 -7.133 25.697 1.00 48.89 N", "ATOM 1513 CA VAL A 222 8.251 -6.190 24.619 1.00 48.64 C", "ATOM 1514 C VAL A 222 9.528 -5.762 23.898 1.00 48.32 C", "ATOM 1572 NH2 ARG A 228 7.890 -13.328 16.363 1.00 59.63 N", "ATOM 1617 N GLU A1005 11.906 -2.722 7.994 1.00 44.02 N"]

# create some variables
charged_res_coords = [] # store x,y,z of extracted charged resiudes
charged_res = ["ARG", "HIS", "LYS", "ASP", "GLU"]

# compile regex
regex = re.compile("^ATOM\s([0-9]{4})\s([A-Z0-9]+)\s([A-Z]{3})\s([A-Z]{1}\s{0,1}[0-9]+)\s([0-9\.-]+)\s([0-9\.-]+)\s([0-9\.-]+)\s([0-9\.-]+)\s([0-9\.-]+)\s([A-Z]{1})$") # iterate over lines in pdb for line in pdb: # check if starts with ATOM if line.startswith('ATOM'): # iterate over charged residues for res in charged_res: # is residue in line? if res in line: # check for match to regex match = regex.match(line) # pull out third spatial third_spatial = match.group(7) print third_spatial # append groups to list charged_res_coords.append(match.groups()[0:-3]) # print the stuff we added print [item for item in charged_res_coords]  Which gives: C:\Users\Steve\Dropbox\Code>python simple_pdb_parser.py 16.363 7.994 [('1572', 'NH2', 'ARG', 'A 228', '7.890', '-13.328', '16.363'), ('1617', 'N', 'GLU', 'A1005', '11.906', '-2.722', '7.994')]  ADD COMMENT 1 Entering edit mode 8.9 years ago Instead of string splitting and regular expression matching, I would rather recommend the string slicing technique. It should be more robust and it strictly follows the PDB format. From Steve's script, this should be: #! /usr/bin/env python """ Simple PDB parser Coded by Steve Moss (gawbul [at] gmail [dot] com) http://about.me/gawbul Slicing features by Pierre Poulain http://cupnet.net/ """ # excerpts from PDB 3UNC # http://www.rcsb.org/pdb/explore/explore.do?structureId=3UNC pdb = [ "ATOM 6455 CA HIS A 863 67.077 124.238 39.765 1.00 21.04 C ", "ATOM 6588 CA ARG A 880 71.845 113.877 36.770 1.00 16.38 C ", "ATOM 6830 CA PHE A 911 61.272 109.012 33.241 1.00 15.29 C ", "ATOM 8183 CA ASP A1084 76.538 97.355 28.275 1.00 17.61 C ", "ATOM 8320 CA LEU A1101 84.288 72.813 25.891 1.00 22.39 C ", "ATOM 8429 CA GLU A1114 76.247 69.337 19.532 1.00 24.21 C " ] charged_res_coords = [] # store x,y,z of extracted charged resiudes charged_res = ["ARG", "HIS", "LYS", "ASP", "GLU"] # iterate over lines in pdb for line in pdb: # check if line starts with "ATOM" if line.startswith('ATOM'): # define fields of interest atom_id = line[6:11].strip() atom_name = line[12:16].strip() res_name = line[17:20].strip() chain_name = line[21:22].strip() residue_id = line[22:26].strip() x = line[30:38].strip() y = line[38:46].strip() z = line[46:54].strip() if res_name in charged_res: # pull out third spatial (y-coordinate) print z # append groups to list charged_res_coords.append([atom_id, atom_name, res_name, chain_name, residue_id, x, y, z]) # print the stuff we added print [item for item in charged_res_coords]  which gives: [poulain@cumin ~]$ python parser.py
39.765
36.770
28.275
19.532
[['6455', 'CA', 'HIS', 'A', '863', '67.077', '124.238', '39.765'], ['6588', 'CA', 'ARG', 'A', '880', '71.845', '113.877', '36.770'], ['8183', 'CA', 'ASP', 'A', '1084', '76.538', '97.355', '28.275'], ['8429', 'CA', 'GLU', 'A', '1114', '76.247', '69.337', '19.532']]


For a quick reminder, I have documented the ATOM line PDB format in a Python perspective here: http://cupnet.net/pdb-file-atom-line-memo/

As mentioned by dimkal, you should also give a try to the PDB parser implemented in BioPython.