Question: how to extract HETATM structures from pdb with python
0
gravatar for woojoy14
12 weeks ago by
woojoy140
woojoy140 wrote:

I made {sample}.csv consisted of chain name, residue name and residue id extracted from original pdb files.

>  {sample}.csv
>     >{sample}, A, NDP, 351

But it includes HETATM so I would like to make new files of pdb format which only contain HETATM information. I used code below and got the error saying string is not defined.

import os
import sys

input = [f for f in os.listdir('./') if f.endswith('.pdb')]

pdb = input

input2 = [c for c in os.listdir('./') if c.endswith('csv')]

csv = input2

HETATM = []

# iterate over lines in pdb
for file in pdb:
    f_open = open('./'+file,'r')
    for file in csv:
        c_open = open('./'+file, 'r')

        for line in f_open:
            if line.startswith('HETATM'):
                res_name = line[17:20].replace(' ','')
                chain_name = line[21:22].replace(' ','')
                        residue_id = line[22:26].replace(' ','')

                        string = chain_name + ',' + res_name + ',' + residue_id + '\n'


            for line2 in c_open:
                if string in c_open:
                    HETATM.append(string)

I am totally a beginner just started to learn python so don't have an idea how to approach. I would really appreciate your help!

pdb • 324 views
ADD COMMENTlink modified 12 weeks ago by fishgolden420 • written 12 weeks ago by woojoy140

Can you post the exact error from your console please? We don't know where your error is arising.

ADD REPLYlink written 12 weeks ago by Joe14k
1
gravatar for fishgolden
12 weeks ago by
fishgolden420
fishgolden420 wrote:

How did you make {sample}.csv ? Your python code has many problems.

I did my best to make your code runnable.

Save following 3 lines as "1hdk.csv".

1hdk,A,LEU,140
1hdk,H,PMB,929
1hdk,A,HOH,2067

and download 1hdk.pdb from https://www.rcsb.org/structure/1HDK

Put those two files and following python script in the same folder & run the script.

import os
import sys
import re

input = [f for f in os.listdir('./') if f.endswith('.pdb')]

pdb = input

# iterate over lines in pdb
for pdbfile in pdb:
    HETATM = {}
    csvfile = re.sub("\.pdb$",".csv",pdbfile)
    f_open = open('./'+pdbfile,'r')
    for line in f_open:
        if line.startswith('HETATM'):
            res_name = line[17:20].replace(' ','')
            chain_name = line[21:22].replace(' ','')
            residue_id = line[22:26].replace(' ','')
            string = chain_name + ',' + res_name + ',' + residue_id
            HETATM[string] = 1

    f_open.close()

    c_open = open('./'+csvfile, 'r')
    print(csvfile+"===")
    for line2 in c_open:
        label = re.sub("^[^,]+,","",re.sub("[\s]","",line2)) #remove {sample}, and whitespace in csv
        if label in HETATM:#check whether the contents in csv is found as HETATM
            print(line2,end="")
    c_open.close()

Then

1hdk.csv===
1hdk,H,PMB,929
1hdk,A,HOH,2067

(The lines you wanted, I think) will be shown.

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by fishgolden420

Thank you very much for your help!

ADD REPLYlink written 12 weeks ago by woojoy140
0
gravatar for Manvitha Reddy
12 weeks ago by
Delhi, India
Manvitha Reddy10 wrote:

"I would like to make new files of pdb format which only contain HETATM information"

For this, you actually don't need to use Biopython as such. You can achieve this by just using file handling.

Code will be as follows:

> with open("input_filename.pdb", "r") as inputFile,open("output_filename.pdb","w") as outFile:
>     for every line in inputFile:
>         if line.startswith("HETATM"):
>             outFile.write(line)

By using this code you can directly extract the HETAM into the output file. When you need all atoms except the HETAM's the just use "if not line.startswith("HETAM"):" This will extract all the atoms information that are not HETAM's

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by Manvitha Reddy10

Are you sure the output file will also be in pdb format (like OP has specified)? Your code will break format as plain Python doesn't know what PDB format means.

ADD REPLYlink written 12 weeks ago by RamRS24k

Yes, what you say is absolutely true, but if the extension used is .pdb then the output file will remain in the PDB format itself and you can even load it into visualization tools to test. I usually use this method to work with the PDB files.

ADD REPLYlink written 12 weeks ago by Manvitha Reddy10

Yeah it shouldn’t be a problem. PDBs are just plaintext files like most bioinformatics file formats. Many of the headers etc are optional.

It’s generally a good idea to use a dedicated parser though just as a rule of thumb.

ADD REPLYlink written 12 weeks ago by Joe14k

I think HETAM must be HETATM.

ADD REPLYlink written 12 weeks ago by fishgolden420

It helps a lot, thank you!

ADD REPLYlink written 12 weeks ago by woojoy140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1959 users visited in the last hour