Question: Parse Many/Multiple .Pdb Files Using Perl Script
0
gravatar for hmarajkrishnan
4.8 years ago by
hmarajkrishnan0 wrote:

I am final year bioinformatics student, doing final year project on "Protein Database". Im using lots of .pdb files, rely alot. I saved all the .pdb files in a folder. Following that i want parse all the .pdb files simultaneously around (50 - 60 .pdb files) and output in a single output file. I dont want to input file name manualy in 'cmd'. Instead,i want the perl script that can read the stored .pdb files in the folder as input, 1 by 1..

(THIS IS THE PERL SCRIPT THAT HELPED ME ALOT. BUT I WANT TO PARSE AROUND 50-60 .pdb FILES WHICH IS STORED IN A FOLDER 1 BY 1 AND OUTPUT IN A SINGLE TEXT FILE. I DONT WANT ENTER THE STORED .pdb FILE NAMES MANULAY IN COMMAND PROMPT)

(THANKS TO MR.JORGE AMIGO LECHUGA)

#!/usr/bin/perl -w
#
# coordExtract.pl
#     # parser that allows editing a pdb
# file to extract the coordinates
#
# Jorge Amigo Lechuga
#

#######################
# READ THE INPUT FILE #
#######################

# input file query
print "\nEnter the input file: ";
$inputFile = <STDIN>;
chomp $inputFile;

unless (open(INPUTFILE, $inputFile)) {
    print "Cannot read from '$inputFile'.\nProgram closing.\n";
    <STDIN>;
    exit;
}

# load the file into an array
chomp(@dataArray = <INPUTFILE>);

# close the file
close(INPUTFILE);

###############
# LET'S WORK! #
###############

# parse the input file saving only backbone atoms coordinates
# format: [string "ATOM"] [number] [atom] [aa] whateva [3 decimal numbers] whateva with two dots in between
for ($line = 0; $line < scalar @dataArray; $line++) {
    if ($dataArray[$line] =~ m/ATOM\s+\d+\s+(\w+)\s+\w{3}\s+.+\s+(\S+\.\S+)\s+(\S+\.\S+)\s+(\S+\.\S+)\s+.+\..+\..+/ig) {
        if ($1 eq "N" || $1 eq "CA" || $1 eq "C") {
            $parsedData{$line} = $2."\t".$3."\t".$4;
        }
    }
}

# create the output file name
$outputFile = "coordinates_".$inputFile;

# open the output file
open (OUTFILE, ">$outputFile");
# print the data lines
foreach $line (sort {$a <=> $b} keys %parsedData) {
    print OUTFILE $parsedData{$line}."\n";
}

# close the output file
close (OUTFILE);

# end message
print "The coordinates of '$inputFile' were saved into '$outputFile'.\n";

# end the program
exit;
pdb protein script multiple • 4.5k views
ADD COMMENTlink modified 4.8 years ago by Dan D6.6k • written 4.8 years ago by hmarajkrishnan0

hi, the perl script is running but, the loop is not working. the script give output for only 1 .pdb file. :(

ADD REPLYlink written 4.8 years ago by hmarajkrishnan0
1

OK I'm taking a look right now. That's encouraging that it's processing one file correctly. Probably just a minor tweak.

ADD REPLYlink written 4.8 years ago by Dan D6.6k

stil the same. loop is not working.

unless (open(INPUTFILE, $pdbFile)) { print "Cannot read from '$pdbFile'.\nProgram closing.\n"; ; #exit; (i disable the exit command and its stil the same) }

ADD REPLYlink written 4.8 years ago by hmarajkrishnan0
2
gravatar for Dan D
4.8 years ago by
Dan D6.6k
Tennessee
Dan D6.6k wrote:

EDIT: I foolishly left the "exit" command inside the loop. My bad!

OK, try running this in the directory with your PDB files. Back them up before you run this!

#!/usr/bin/perl -w
#
# coordExtract.pl
#     # parser that allows editing a pdb
# file to extract the coordinates
#
# Jorge Amigo Lechuga
#

#######################
# READ THE INPUT FILE #
#######################

my @pdbFiles = glob("*.pdb");
foreach my $pdbFile (@pdbFiles){

    unless (open(INPUTFILE, $pdbFile)) {
        print "Cannot read from '$pdbFile'.\nProgram closing.\n";
        <STDIN>;
        exit;
    }

    # load the file into an array
    chomp(@dataArray = <INPUTFILE>);

    # close the file
    close(INPUTFILE);

    ###############
    # LET'S WORK! #
    ###############

    # parse the input file saving only backbone atoms coordinates
    # format: [string "ATOM"] [number] [atom] [aa] whateva [3 decimal numbers] whateva with two dots in between
    for ($line = 0; $line < scalar @dataArray; $line++) {
        if ($dataArray[$line] =~ m/ATOM\s+\d+\s+(\w+)\s+\w{3}\s+.+\s+(\S+\.\S+)\s+(\S+\.\S+)\s+(\S+\.\S+)\s+.+\..+\..+/ig) {
            if ($1 eq "N" || $1 eq "CA" || $1 eq "C") {
                $parsedData{$line} = $2."\t".$3."\t".$4;
            }
        }
    }

    # create the output file name
    $outputFile = "coordinates_".$pdbFile;

    # open the output file
    open (OUTFILE, ">$outputFile");
    # print the data lines
    foreach $line (sort {$a <=> $b} keys %parsedData) {
        print OUTFILE $parsedData{$line}."\n";
    }

    # close the output file
    close (OUTFILE);

    # end message
    print "The coordinates of '$pdbFile' were saved into '$outputFile'.\n";

}

# end the program
exit;
ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Dan D6.6k

stil the same. unless (open(INPUTFILE, $pdbFile)) { print "Cannot read from '$pdbFile'.\nProgram closing.\n"; <stdin>; #exit; (i disable the exit command and its stil the same) }

ADD REPLYlink written 4.8 years ago by hmarajkrishnan0
1

It worked for me. See this screenshot:

https://www.dropbox.com/s/5csowphrqncui7n/pdb.JPG

Did you copy the entire script above? If so, put it in the directory with your PDB files. Navigate to the directory with the PDB files and type perl coordExtract.pl

ADD REPLYlink written 4.8 years ago by Dan D6.6k

OMG, ITS WORKS IN window o/s!!!!!! thanks alot DEEDEE!!!!! b4 tiz i was tryin in ubuntu and it did not work, bt works charm is WINDOWS!!!! THANKS AGAIN!!!!!!!! 1 more thing, can u give all the output in single file???

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by hmarajkrishnan0
1

In Windows, navigate to the directory with the pdb output. Type the following command: type coordinates_* > aggregateOutput.pdb

ADD REPLYlink written 4.8 years ago by Dan D6.6k

#!/usr/bin/perl -w #

coordExtract.pl

# parser that allows editing a pdb

file to extract the coordinates

#count for all amino acids existing in the protein $count_of_alanine=0; $count_of_arginine=0; $count_of_asparagine=0; $count_of_aspartic_acid=0; $count_of_cysteine=0; $count_of_glutamic_acid=0; $count_of_glutamine=0; $count_of_glycine=0; $count_of_histidine=0; $count_of_isoleucine=0; $count_of_leucine=0; $count_of_lysine=0; $count_of_methionine=0; $count_of_phenylalanine=0; $count_of_proline=0; $count_of_serine=0; $count_of_threonine=0; $count_of_tryptophan=0; $count_of_tyrosine=0; $count_of_valine=0; #count for groups of amino acids $count_of_charged=0; $count_of_polar=0; $count_of_aromatic=0; $count_of_hydrophobic=0;

input file query

my @pdbFiles = glob("*.pdb"); foreach my $pdbFile (@pdbFiles){

unless (open(INPUTFILE, $pdbFile)) { print "Cannot read from '$pdbFile'.\nProgram closing.\n"; <stdin>; exit; }

load the file into an array

chomp(@dataArray = <inputfile>);

close the file

close(INPUTFILE);

###############
# LET'S WORK! #
###############

parse the input file saving only backbone atoms coordinates

format: [string "ATOM"] [number] [atom] [aa] whateva [3 decimal numbers] whateva with two dots in between

for ($line = 0; $line < scalar @dataArray; $line++) { if ($dataArray[$line]=~/^HEADER\s+(.?)$/) { $header = $1;
} if ($dataArray[$line]=~/^TITLE\s+(.
?)$/) { $parsing{$line} = $1; } if ($dataArray[$line] =~ m/ATOM\s+(\d+)\s+(\w+)\s+(\w{3})\s+.+\s+(\S+.\S+)\s+(\S+.\S+)\s+(\S+.\S+)\s+.+..+..+/ig) { if ($2 eq "N" || $2 eq "CA" || $2 eq "C") { $parsedData{$line} = $1."\t\t".$3."\t\t".$4."\t\t".$5."\t\t".$6;

if($3 eq "ALA"){
$count_of_alanine++;
}
if($3 eq "ARG"){
$count_of_arginine++;
}
if($3 eq "ASN"){
$count_of_asparagine++;
}
if($3 eq "ASP"){
$count_of_aspartic_acid++;
}
if($3 eq "CYS"){
$count_of_cysteine++;
}
if($3 eq "GLU"){
$count_of_glutamic_acid++;
}
if($3 eq "GLN"){
$count_of_glutamine++;
}
if($3 eq "GLY"){
$count_of_glycine++;
}
if($3 eq "HIS"){
$count_of_histidine++;
}
if($3 eq "ILE"){
$count_of_isoleucine++;
}
if($3 eq "LEU"){
$count_of_leucine++;
}
if($2 eq "LYS"){
$count_of_lysine++;
}
if($3 eq "MET"){
$count_of_methionine++;
}
if($3 eq "PHE"){
$count_of_phenylalanine++;
}
if($3 eq "PRO"){
$count_of_proline++;
}
if($3 eq "SER"){
$count_of_serine++;
}
if($3 eq "THR"){
$count_of_threonine++;
}
if($3 eq "TRP"){
$count_of_tryptophan++;
}
if($3 eq "TYR"){
$count_of_tyrosine++;
}
if($3 eq "VAL"){
$count_of_valine++;
}
if($3 eq "ARG"||$3 eq "ASP"||$3 eq "GLU"||$3 eq "LYS"){
$count_of_charged++;
}
if($3 eq "ASN"||$3 eq "GLN"||$3 eq "GLY"||$3 eq "MET"||$3 eq "PRO"){
$count_of_polar++;
}
if($3 eq "PHE"||$3 eq "TRP"||$3 eq "TYR"||$3 eq "HIS"){
$count_of_aromatic++;
}
if($3 eq "ALA"||$3 eq "ILE"||$3 eq "LEU"||$3 eq "VAL"){
$count_of_hydrophobic++;
}
}

}

}

create the output file name

$outputFile = "coordinates_".$pdbFile;

open the output file

open (OUTFILE, ">$outputFile");

print the data lines

print OUTFILE $header, "\n";
foreach $line (sort {$a <=> $b} keys %parsing) {
print OUTFILE $parsing{$line}."\n";

} print OUTFILE $title, "\n"; print OUTFILE "-----------------------------------------------------------------------\n"; print OUTFILE "------------ALL BACKBONE AMINO ACIDS IN THE MEMBRANE PROTEIN-----------\n"; print OUTFILE "-----------------------------------------------------------------------\n"; print OUTFILE "\n";
print OUTFILE "Atom Number\tAmino acid\tX coordinate\tY Coordinate\tZ Coordinate\n"; foreach $line (sort {$a <=> $b} keys %parsedData) { print OUTFILE $parsedData{$line}."\n"; }

print OUTFILE "\n";
print OUTFILE "\n";
print OUTFILE "------------------------------------------------------------------------\n";
print OUTFILE "-----AMINO ACID NUMBERS AND PERCENTAGE FOR THE BACKBONE AMINO ACIDS-----\n";
print OUTFILE "------------------------------------------------------------------------\n";
print OUTFILE "\n";    
print OUTFILE "Amino acid\tTotal Number(N)\n";
print OUTFILE "--------------------------------------------------\n";
print OUTFILE "Alanine\t\t", $count_of_alanine++, "\t\t\n";
print OUTFILE "Arginine\t", $count_of_arginine++, "\t\t\n";
print OUTFILE "Asparagine\t", $count_of_asparagine++, "\t\t\n";
print OUTFILE "Aspartic Acid\t", $count_of_aspartic_acid++, "\t\t\n";
print OUTFILE "Cysteine\t", $count_of_cysteine++, "\t\t\n";
print OUTFILE "Glutamic Acid\t",$count_of_glutamic_acid++, "\t\t\n";
print OUTFILE "Glutamine\t",$count_of_glutamine++, "\t\t\n";
print OUTFILE "Glycine\t\t",$count_of_glycine++, "\t\t\n";
print OUTFILE "Histidine\t",$count_of_histidine++, "\t\t\n";
print OUTFILE "Isoleucine\t",$count_of_isoleucine++, "\t\t\n";
print OUTFILE "Leucine\t\t",$count_of_leucine++, "\t\t\n";
print OUTFILE "Lysine\t\t",$count_of_lysine++, "\t\t\n";
print OUTFILE "Methionine\t",$count_of_methionine++, "\t\t\n";
print OUTFILE "Phenylalanine\t",$count_of_phenylalanine++, "\t\t\n";
print OUTFILE "Proline\t\t",$count_of_proline++, "\t\t\n";
print OUTFILE "Serine\t\t",$count_of_serine++, "\t\t\n";
print OUTFILE "Threonine\t",$count_of_threonine++, "\t\t\n";
print OUTFILE "Tryptophan\t",$count_of_tryptophan++, "\t\t\n";
print OUTFILE "Tyrosine\t",$count_of_tyrosine++, "\t\t\n";
print OUTFILE "Valine\t\t",$count_of_valine++, "\t\t\n";
print OUTFILE "--------------------------------------------------\n";
print OUTFILE "Total\t\t", scalar(keys %parsedData), "\t\t\n";
print OUTFILE "\n";
print OUTFILE "\n";
print OUTFILE "CHARGED AMINO ACIDS\t\t\t",$count_of_charged++,"\n";
print OUTFILE "POLAR AMINO ACIDS\t\t\t",$count_of_polar++,"\n";
print OUTFILE "AROMATIC AMINO ACIDS\t\t\t",$count_of_aromatic++,"\n";
print OUTFILE "HYDROPHOBIC AMINO ACIDS\t\t\t",$count_of_hydrophobic++,"\n";
print OUTFILE "\n";
print OUTFILE "\n";


# close the output file
close (OUTFILE);

# end message
print "The coordinates of '$pdbFile' were saved into '$outputFile'.\n";

}

end the program

exit;

ADD REPLYlink written 4.8 years ago by hmarajkrishnan0

i use this perl script to parse .pdb files in a folder(around 1000 files). unfortunately the loop in the script causing problem.. the number of amino acids and other data in the .pdb files keep on increasing... the data stacking with another .pdb file i guess.. help me with the loop..

ADD REPLYlink written 4.8 years ago by hmarajkrishnan0

i tried, bt its not working.. im using this mac now.. i use this command to read the entire .pdb files 1 folders, "for fl in *.pdb;do perl filename.pl $fl; done".... its working prety well.. but compile all the outputs in only 1 single file???

ADD REPLYlink written 4.8 years ago by hmarajkrishnan0
1
gravatar for Dan D
4.8 years ago by
Dan D6.6k
Tennessee
Dan D6.6k wrote:

Step 1: Make a copy of the coordExtract.pl script and change this line:

print "\nEnter the input file: "; $inputFile = ; chomp $inputFile;

to

$inputFile = $ARGV[0];

Step 2: In your Linux bash shell, navigate to the directory containing the .PDB files and type the following:

for f in *.pdb
do
   /path/to/coordExtract.pl $f
done

Make copies of the original script and the PDB files before you test this solution!

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Dan D6.6k

thank you for the commend, bt i using windows OS.. can you give the perl script for windows??

ADD REPLYlink written 4.8 years ago by hmarajkrishnan0
1

OK, so just copy the files to a linux server and do the above. ;) Seriously though, I'll stop being lazy and write you something that's O/S agnostic. One minute...

ADD REPLYlink written 4.8 years ago by Dan D6.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1881 users visited in the last hour