Question: How to extract and calculate all protein atom to ligand atom distances from pdb files in PERL
0
gravatar for srdjanmasirevic2
3 months ago by
srdjanmasirevic210 wrote:

Hello, I am trying to find the way how to calculate all atom distances from pdb file ? CAn anybody help me. I have a txt files that look like this :

ATOM       1279 C    ALA    81      -1.925 -11.270   1.404
ATOM       1280 O    ALA    81      -0.279   9.355  15.557
ATOM       1281 OXT  ALA    81      -2.188  10.341  15.346  
HETATM   1282 N    THR    82      29.632   5.205   5.525
HETATM   1283 H1   THR    82      30.175   4.389   5.768
HETATM   1284 H2   THR    82      28.816   4.910   5.008

I wrote a script for calculating distances but still I am having some problem, it doesnt print any distances. I was trying to understand why but I am not sure sinse doesnt give me any error.

#!/usr/local/bin/perl 

use strict;
use warnings;

open(IN, $ARGV[0]) or die "$!"; 
my (@refer, @points);
my $part = 0;
my $dist;
while (my $line = <IN>) { 
    chomp($line);
    if ($line =~ /^HETATM/) {
        $part++;
        next;
    }
    my @array = (substr($line, 30, 8),substr($line,38,8),substr($line,46,8));
#    print "@array\n";
    if ($part == 0) {
        push @refer, [ @array ]; 
    } elsif ($part ==1){
        push @points, [ @array ]; 
    }
}

    foreach my $ref(@refer) {
    my ($x1, $y1, $z1) = @{$ref};
    foreach my $atom(@points) {
        my ($x, $y, $z) = @{$atom};
        my $dist = sqrt( ($x-$x1)**2 + ($y-$y1)**2 + ($z-$z1)**2 );
    print $dist;

    }

}
perl • 376 views
ADD COMMENTlink modified 3 months ago by yamule180 • written 3 months ago by srdjanmasirevic210

This looks like homework, what have you tried?

ADD REPLYlink modified 3 months ago • written 3 months ago by Michael Dondrup43k

It's some kind of personal homework, not for school but form me to make some dataset. So what I have done so far is wrote a script in perl that from the whole pdb dataset I kept only those pdb files that have ligand in its binding pocket (I included some cut off for number of heavy atoms to exclude ions and solvents, so ended up with pdb files that have ligands). I also extracted from each file protein and ligand coordinates into another txt file. So currently I am having a files that look like this: for example file. 2KRJ.txt:

           Ax     Ay    Az
ATOM(1)   -1.23  4.55  -1.78 
ATOM(2)   -1.23  4.55  -1.78 
ATOM(3)   -1.23  4.55  -1.78 
ATOM(4)   -1.23  4.55  -1.78 
...(and so on)
           Bx     By    Bz
HETATM(1)  -7.5    5.7  -1.4
HETATM(2)  -7.5    5.7  -1.4
HETATM(3)  -7.5    5.7  -1.4
HETATM(4)  -7.5    5.7  -1.4
...(and so on)

The values for these coordinates are not the same of course. So I also found the formula fo calculating distances, but I am not sure how to calculate it: I want each atom (x,y,z coordinates) from protein (that starts with ^ATOM) to be calculated with each atom from ligand (that starts with ^HETATM). If the number of each calculatation is more then 5 (I should assigned some cut off) then move the file or something like that.

This is the formula: distance = sqrt(($Ax - $Bx)**2 + ($Ay - $By)**2 + ($Az - $Bz)**2)

I am not that experienced in programming so trying to figure out how to approach to this problem. Thanks

ADD REPLYlink modified 3 months ago • written 3 months ago by srdjanmasirevic210

So basically I am trying to calculate ATOM(1) with all HETATM(1,2,3,4) and the same for each atom of the protein. if I end up with result that is <= to five then I should keep this file. Overall from all files I am trying to keep only those in which the distance between ligand and some protein atoms is less or equal to 5 Armstrongs.

ADD REPLYlink modified 3 months ago • written 3 months ago by srdjanmasirevic210

Ok, that is totally feasible in Perl with two nested loops or maps, but why don't you use R on the file you made? read the file using atoms <- read.delim(my.file, sep="\t", header=T, row.names=1) ; dist(atoms); that what be about 1/10 of the code required in Perl.

ADD REPLYlink written 3 months ago by Michael Dondrup43k
1
gravatar for Michael Dondrup
3 months ago by
Bergen, Norway
Michael Dondrup43k wrote:

see Algorithm::DistanceMatrix package in Perl or check out the R-solution:

atoms <- read.delim(my.file, sep="\t", header=T, row.names=1) ; dist(atoms);
ADD COMMENTlink modified 3 months ago • written 3 months ago by Michael Dondrup43k
0
gravatar for srdjanmasirevic2
3 months ago by
srdjanmasirevic210 wrote:

Im using perl because whole project I started to do in perl and want to finish it in the same manner. I have some basics of Python and Perl but I have never used R; overall to be honest my programming sucks :P Perl seems to be fine for data proccessing but it is also a bit confusing in some aspects. I will take a look at this package, I have never used to install any of the additional packages for perl. This should provide me with some additional commands in perl?

Overall I know that this task is pretty straight forward but I am having a problem because I'm actually not sure how to assign a value for each coordinate and to give command to calculate distance between protein and ligand atoms. Maybe I can make use subruotines but Im not completely sure about it :P

ADD COMMENTlink modified 3 months ago • written 3 months ago by srdjanmasirevic210

I have made some progress, but I cant figure it out why script doesnt work. I have wrote two loops and pushed values of coordinates into seperate arrays and ttried to print distances but it doesnt work. Can You please take a look cause I am trying to change it numerous times. This is the script :

#!/usr/local/bin/perl 

use strict;
use warnings;

open(IN, $ARGV[0]) or die "$!"; 
my (@refer, @points);
my $part = 0;
my $dist;
while (my $line = <IN>) { 
    chomp($line);
    if ($line =~ /^HETATM/) {
        $part++;
        next;
    }
    my @array = (substr($line, 30, 8),substr($line,38,8),substr($line,46,8));
    ### MD: use split to split by delimiter:
    my @coords = split $line;


   ### MD: not sure what this is for. Do you need to distinguish between 
   ### two types of atoms? I thought you wanted just 
   ### the distances between all atoms?
   ### why not generate a simple nested list of all atom coords first?

#    print "@array\n";
    if ($part == 0) {
        push @refer, [ @array ]; 
    } elsif ($part ==1){
        push @points, [ @array ]; 
    }
}


   ### for calculation of a symmetric distance a nested for loop is more efficient 
   ### than a simple foreach, because you can calculate the the lower triangle 


  ### something like this should do the trick, not tested, and you need to define a sub dist
  my $distMat = [];
   for (my $i=0; $i < $M; $i++) {
        $distMat->[$i] = [];
        for (my $j=$i; $j < $N; $j++) {
            $distMat->[$i][$j] = dist($coords[$i], $coords[$j]);
        }
   }



    foreach my $ref(@refer) {
    my ($x1, $y1, $z1) = @{$ref};
    foreach my $atom(@points) {
        my ($x, $y, $z) = @{$atom};
        my $dist = sqrt( ($x-$x1)**2 + ($y-$y1)**2 + ($z-$z1)**2 );
    print $dist;

    }

}
ADD REPLYlink modified 3 months ago by Michael Dondrup43k • written 3 months ago by srdjanmasirevic210

I have put some edits into your example code, please have a look.

ADD REPLYlink written 3 months ago by Michael Dondrup43k

I wanted to check the distances between all protein atoms (ATOM) and ligand atoms (HETATM). Something like ATOM1 X HETATM1; ATOM2 X HETATM1 X ATOM3 X HETATM1 and so on.. I tried to put coordinated from the protein atoms ATOM into the one array and same for ligand atoms and then to calculate the distance between each of them. I am also trying to figure out how to assign that if the result of each calculation is less more then 5 Armstrongs then delete these lines that were involved into these calculation. I want to add that because at the end want to know which atoms from ligand and proteins are closest to each other, but still have no idea how to implement it into this script

ADD REPLYlink written 3 months ago by srdjanmasirevic210
0
gravatar for yamule
3 months ago by
yamule180
Japan
yamule180 wrote:

I think the main problems of the script are

if ($line =~ /^HETATM/) {
    $part++;
    next;
}

and

 elsif ($part ==1){
        push @points, [ @array ]; 
    }

If HETATM label is found, it will go to the first line of the "while" loop (because of "next") and no element will be pushed in "@points".

In addition, your sample lines are not written in a valid PDB format. Please check

https://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM

Thus the simple pdb entry and code will be as follows

ATOM   1279 C    ALA    81      -1.925 -11.270   1.404
ATOM   1280 O    ALA    81      -0.279   9.355  15.557
ATOM   1281 OXT  ALA    81      -2.188  10.341  15.346
HETATM 1282 N    THR    82      29.632   5.205   5.525
HETATM 1283 H1   THR    82      30.175   4.389   5.768
HETATM 1284 H2   THR    82      28.816   4.910   5.008

====

#!/usr/local/bin/perl 

use strict;
use warnings;
open(IN, $ARGV[0]) or die "$!"; 
my (@refer, @points);
my $dist;
while (my $line = <IN>) { 
    chomp($line);
    my $hetflag = 0;
    if ($line =~ /^HETATM/) {
        $hetflag = 1;
    }elsif($line !~ /^ATOM/){
        next;
    }

    my @array = (substr($line, 30, 8),substr($line,38,8),substr($line,46,8));
    if ($hetflag == 0) {
        push @refer, [ @array ]; 
    } elsif ($hetflag ==1){
        push @points, [ @array ]; 
    }

}



foreach my $ref(@refer) {
    my ($x1, $y1, $z1) = @{$ref};
    foreach my $atom(@points) {
        my ($x, $y, $z) = @{$atom};
        my $dist = sqrt( ($x-$x1)**2 + ($y-$y1)**2 + ($z-$z1)**2 );
         print $dist."\n";
    }
}

By the way, finding ligand coordinates from PDB format file is a bit difficult. Because HETATM label is used for unusual amino acid (modified residues) like SELENOMETHIONINE

http://www.rcsb.org/pdb/explore/explore.do?structureId=2a0u

Usually, coordinates of ligands come after "TER" line(, I think).

If you use PDBx/mmCIF format, it is easy to distinguish ligands but the format is so difficult to parse.

BioPerl may handle the format. But I don't recommend Perl because it's an old language.

ADD COMMENTlink modified 3 months ago • written 3 months ago by yamule180

Actually from my script I manage to change two lines:

  1. Just deleted next; line after $part ++;

  2. Delete ==1 in line } elsif ($part ==1){

its is calculating fine, but I am just not sure how to assign to this script that if the distance is more then 5 then to delete those lines, and if it is not then to print those lines in new file or to owerwright the existing files. So that I can end up with the files that would have only printed atoms lines in which distance is showed to be less then 5.

It is not a problem to say that if $dist <= 5 { print "Distance is less then 5\n"}; but I am not sure how to modify these files and to print these atom lines only. Best

ADD REPLYlink modified 3 months ago • written 3 months ago by srdjanmasirevic210
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1101 users visited in the last hour