Question: How To Extract Just The Coordinate Values From A Pdb File, Using Perl Only?
2
gravatar for Sakshi
7.7 years ago by
Sakshi20
Sakshi20 wrote:

I have a pdb id of a protein, need to get the 3d co-ordinate values of its atoms only. Can anyone help me in writing the code to grip out just the co-ordinate values.

ADD COMMENTlink modified 19 months ago by Biostar ♦♦ 20 • written 7.7 years ago by Sakshi20
10
gravatar for Neilfws
7.7 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

Rather than write the code for you, I thought it would be useful to explore in general terms how we go about solving this type of problem.

What you want to do is parse a file: a very common operation in bioinformatics. So common in fact, that you will often find that someone has done the work for you. I'll come back to that briefly at the end of the answer.

Here are the steps involved in developing a parser.

First, understand the structure of the file - in your case, a PDB file. This is achieved by reading documentation and by examining the file yourself in e.g. a text editor. Knowing that PDB files come from the Protein Data Bank, a web search should lead you to the appropriate documentation.

Second, identify the sub-parts of the file that you want to extract. Try to determine what makes them distinct from other parts of the file. When you look at a PDB file, you'll see that the parts containing coordinates look like this:

ATOM    834  CD2 TYR A 108       2.688  48.014   2.423  1.00 13.01           C
HETATM  968 CU   CU1 A 201      16.622  34.752   4.635  0.90 18.04          CU

Try to explain in words what you see. You might say something like "line starts with ATOM or HETATM, followed by several space-separated fields, of which columns 7, 8 and 9 are X, Y and Z coordinates."

Third, without writing any code, describe how you would process the file to extract only those parts that interest you. Be precise and detailed. For example:

  • open the file
  • read one line at a time
  • does the line begin with ATOM or HETATM?
  • if so, split the line on space
  • print out only the fields with X, Y and Z

Finally, you are ready to write your code. What you wrote in step 3 is a guide to the methods that you need to learn (if you don't know them already) in your chosen programming language. So you need to know how to open a file for reading, scan one line at a time, match patterns to a line, split lines into arrays and print out the correct array elements.

And that is how you write a parser.

However - there's often no need to reinvent the wheel. There is a format called "XYZ" which is essentially just the coordinates and a web search for "convert PDB to XYZ" is sure to throw up a few tools. And don't be tied to one language if a good solution is available in another - such as this Python PDB->XYZ converter.

ADD COMMENTlink written 7.7 years ago by Neilfws48k
1

This is one of those answers that makes you wish you could up-vote more than once.

ADD REPLYlink written 7.7 years ago by Lars Juhl Jensen11k

Thank you Sir. You explained in such a nice way. Thank you so much.

ADD REPLYlink written 7.7 years ago by Sakshi20

Wicked! I really prefer such conceptual answers rather than the usual code copy & pasting. The learning effect is much greater with this kind of response, even though it might not be the quickest route to take.

ADD REPLYlink written 7.7 years ago by Joachim2.8k

While the pseudocode will work for most cases the split on space bit won't give the correct results for all PDB entries. This is due to the fields being position based rather than delimited. The appropriate positions to find each field in the line can be obtained from the PDB format documentation.

ADD REPLYlink written 6.8 years ago by Hamish3.0k
3
gravatar for Jorge Amigo
7.7 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

I honestly didn't revise this for a very long time, but I wrote this code back in 2004 that seems to do exactly what you need. feel free to use it, and let me know if it helps you in any way.

#!/usr/bin/perl -w

#
# coordExtract.pl
#
# parser that allows editing a pdb
# file to extract the coordinates
#
# Jorge Amigo Lechuga
#

#######################
# READ THE INPUT FILE #
#######################

# input file query
print "\nEnter the input file: ";
$inputFile = <STDIN>;
chomp $inputFile;

unless (open(INPUTFILE, $inputFile)) {
    print "Cannot read from '$inputFile'.\nProgram closing.\n";
    <STDIN>;
    exit;
}

# load the file into an array
chomp(@dataArray = <INPUTFILE>);

# close the file
close(INPUTFILE);

###############
# LET'S WORK! #
###############

# parse the input file saving only backbone atoms coordinates
# format: [string "ATOM"] [number] [atom] [aa] whateva [3 decimal numbers] whateva with two dots in between
for ($line = 0; $line < scalar @dataArray; $line++) {
    if ($dataArray[$line] =~ m/ATOM\s+\d+\s+(\w+)\s+\w{3}\s+.+\s+(\S+\.\S+)\s+(\S+\.\S+)\s+(\S+\.\S+)\s+.+\..+\..+/ig) {
        if ($1 eq "N" || $1 eq "CA" || $1 eq "C") {
            $parsedData{$line} = $2."\t".$3."\t".$4;
        }
    }
}

# create the output file name
$outputFile = "coordinates_".$inputFile;

# open the output file
open (OUTFILE, ">$outputFile");

# print the data lines
foreach $line (sort {$a <=> $b} keys %parsedData) {
    print OUTFILE $parsedData{$line}."\n";
}

# close the output file
close (OUTFILE);

# end message
print "The coordinates of '$inputFile' were saved into '$outputFile'.\n";

# end the program
exit;
ADD COMMENTlink written 7.7 years ago by Jorge Amigo11k
1

Why save it all in a hash first? You might as well have the "print OUTFILE" statement right where you parse the data :-)

ADD REPLYlink written 7.7 years ago by Lars Juhl Jensen11k

hahaha... this were days when data was light, and the state of the art was to load everything onto RAM to increase speed. sure dealing with the input lines as they are being read, without storing them into memory, would be more elegant. as I said, I just dumped my original code created when I only had like 1 year experience with Perl ;)

ADD REPLYlink written 7.7 years ago by Jorge Amigo11k

Thanks Jorge. Thanks so much. yes, it worked for me.

ADD REPLYlink written 7.7 years ago by Sakshi20

this code prints the coordinates in tab delimited form which is not useful when we have to open the structure in visualizing tools..

ADD REPLYlink written 7.6 years ago by Kanika10

true, but as I stated this was a piece of code I had from long ago, so I just wanted to share it here in case it helped in any way. if you don't like the tabulated format, you may modify the printing hash $parsedData{$line} = $2."t".$3."t".$4; as needed.

ADD REPLYlink written 7.6 years ago by Jorge Amigo11k

@Jorge this is a great illustration of why reading the format documentation is so important. The coordinates in the PDB format are in fixed width fields which do not have a separator between them, thus positional parsing needs to be used for the coordinates.

ADD REPLYlink written 6.8 years ago by Hamish3.0k
2
gravatar for Woa
7.7 years ago by
Woa2.7k
United States
Woa2.7k wrote:

Using Bioperl with some code lifted from this post

use strict;
use warnings;
use Bio::Structure::IO;
my $filename="2AAI.pdb";
#use Data::Dumper;
my $stream = Bio::Structure::IO->new(-file => $filename,
                                    -format => 'PDB');

while (my $struc = $stream->next_structure) {   
    for my $chain ($struc->get_chains) {
        my $chainid = $chain->id;
        # one-letter chaincode if present, 'default' otherwise
        for my $res ($struc->get_residues($chain)) {
            my $resid = $res->id;
            # format is 3-lettercode - dash - residue number, e.g. PHE-20
            my @atoms = $struc->get_atoms($res); # Atom objects, given a Residue
            map{my @xyz = $_->xyz;print join("\t",@xyz),"\n";}@atoms;  
        }
    }
}
ADD COMMENTlink written 7.7 years ago by Woa2.7k
1
gravatar for Jordeu
7.7 years ago by
Jordeu20
Barcelona
Jordeu20 wrote:

Check this other question:

http://biostar.stackexchange.com/questions/1638/how-to-extract-just-the-coordinate-values-from-a-pdb-file-converted-to-a-text-fil/6773#6773

ADD COMMENTlink written 7.7 years ago by Jordeu20

although the answers on that thread should be read since a couple of interesting ideas are mentioned, I understood that Sakshi was interested on a "perl only" solution and not java based.

ADD REPLYlink written 7.7 years ago by Jorge Amigo11k

yeah.... Jordeu... thats in Java. I need the code in perl only. thanks anyways.

ADD REPLYlink written 7.7 years ago by Sakshi20
1
gravatar for Woa
7.7 years ago by
Woa2.7k
United States
Woa2.7k wrote:

Using Perlmol :

use strict;
use warnings;
use Chemistry::Mol;
use Chemistry::File::PDB;
my $pdb = Chemistry::Mol->read("2AAI.pdb");

map{my $serial_no=$_;my $name=$_->name;my @xyz = $_->coords->array;
    print $serial_no,"\t",$_->name,"\t",join("\t",@xyz),"\n";
}($pdb->atoms);

A very good introduction on perl based PDB parsing and a OpenGl based molecule viewer can be found from THIS issue of Perl Journal, authored by Simon Cozens

ADD COMMENTlink modified 6.8 years ago • written 7.7 years ago by Woa2.7k

For more details and other attributes see this tutorial http://perlmol.org/pod/Chemistry/Tutorial.html

There is a Bioperl module too, but I didn't try it so far:

http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/Structure/IO/pdb.pm

ADD REPLYlink written 7.7 years ago by Woa2.7k

A very good introduction on perl based PDB parsing and a OpenGl based molecule viewer can be found from the following issue of Perl Journal, authored by Simon Cozens http://japhy.perlmonk.org/personal/0408tpj.pdf

ADD REPLYlink written 7.6 years ago by Woa2.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1177 users visited in the last hour