Question

how to extract HETATM structures from pdb with python

0

Entering edit mode

4.7 years ago

woojoy14 ▴ 10

I made {sample}.csv consisted of chain name, residue name and residue id extracted from original pdb files.

>  {sample}.csv
>     >{sample}, A, NDP, 351

But it includes HETATM so I would like to make new files of pdb format which only contain HETATM information. I used code below and got the error saying string is not defined.

import os
import sys

input = [f for f in os.listdir('./') if f.endswith('.pdb')]

pdb = input

input2 = [c for c in os.listdir('./') if c.endswith('csv')]

csv = input2

HETATM = []

# iterate over lines in pdb
for file in pdb:
    f_open = open('./'+file,'r')
    for file in csv:
        c_open = open('./'+file, 'r')

        for line in f_open:
            if line.startswith('HETATM'):
                res_name = line[17:20].replace(' ','')
                chain_name = line[21:22].replace(' ','')
                        residue_id = line[22:26].replace(' ','')

                        string = chain_name + ',' + res_name + ',' + residue_id + '\n'


            for line2 in c_open:
                if string in c_open:
                    HETATM.append(string)

I am totally a beginner just started to learn python so don't have an idea how to approach. I would really appreciate your help!

pdb • 3.7k views

ADD COMMENT • link updated 4.7 years ago by fishgolden ▴ 510 • written 4.7 years ago by woojoy14 ▴ 10

0

Entering edit mode

Can you post the exact error from your console please? We don't know where your error is arising.

ADD REPLY • link 4.7 years ago by Joe 21k

0

Entering edit mode

4.7 years ago

Manvitha Reddy ▴ 10

"I would like to make new files of pdb format which only contain HETATM information"

For this, you actually don't need to use Biopython as such. You can achieve this by just using file handling.

Code will be as follows:

> with open("input_filename.pdb", "r") as inputFile,open("output_filename.pdb","w") as outFile:
>     for every line in inputFile:
>         if line.startswith("HETATM"):
>             outFile.write(line)

By using this code you can directly extract the HETAM into the output file. When you need all atoms except the HETAM's the just use "if not line.startswith("HETAM"):" This will extract all the atoms information that are not HETAM's

ADD COMMENT • link 4.7 years ago by Manvitha Reddy ▴ 10

0

Entering edit mode

Are you sure the output file will also be in pdb format (like OP has specified)? Your code will break format as plain Python doesn't know what PDB format means.

ADD REPLY • link 4.7 years ago by Ram 43k

0

Entering edit mode

Yes, what you say is absolutely true, but if the extension used is .pdb then the output file will remain in the PDB format itself and you can even load it into visualization tools to test. I usually use this method to work with the PDB files.

ADD REPLY • link 4.7 years ago by Manvitha Reddy ▴ 10

0

Entering edit mode

Yeah it shouldn’t be a problem. PDBs are just plaintext files like most bioinformatics file formats. Many of the headers etc are optional.

It’s generally a good idea to use a dedicated parser though just as a rule of thumb.

ADD REPLY • link 4.7 years ago by Joe 21k

0

Entering edit mode

I think HETAM must be HETATM.

ADD REPLY • link 4.7 years ago by fishgolden ▴ 510

0

Entering edit mode

It helps a lot, thank you!

ADD REPLY • link 4.7 years ago by woojoy14 ▴ 10

score 1 · Accepted Answer · 2019-08-20

How did you make {sample}.csv ? Your python code has many problems.

I did my best to make your code runnable.

Save following 3 lines as "1hdk.csv".

1hdk,A,LEU,140
1hdk,H,PMB,929
1hdk,A,HOH,2067

and download 1hdk.pdb from https://www.rcsb.org/structure/1HDK

Put those two files and following python script in the same folder & run the script.

import os
import sys
import re

input = [f for f in os.listdir('./') if f.endswith('.pdb')]

pdb = input

# iterate over lines in pdb
for pdbfile in pdb:
    HETATM = {}
    csvfile = re.sub("\.pdb$",".csv",pdbfile)
    f_open = open('./'+pdbfile,'r')
    for line in f_open:
        if line.startswith('HETATM'):
            res_name = line[17:20].replace(' ','')
            chain_name = line[21:22].replace(' ','')
            residue_id = line[22:26].replace(' ','')
            string = chain_name + ',' + res_name + ',' + residue_id
            HETATM[string] = 1

    f_open.close()

    c_open = open('./'+csvfile, 'r')
    print(csvfile+"===")
    for line2 in c_open:
        label = re.sub("^[^,]+,","",re.sub("[\s]","",line2)) #remove {sample}, and whitespace in csv
        if label in HETATM:#check whether the contents in csv is found as HETATM
            print(line2,end="")
    c_open.close()

Then

1hdk.csv===
1hdk,H,PMB,929
1hdk,A,HOH,2067

(The lines you wanted, I think) will be shown.