extract fasta sequences from PDB file by using perl script
1
0
Entering edit mode
6.4 years ago
skjobs1234 ▴ 40

I would like to extract one letter amino acid code from the PDB coordinate file.

sequence • 3.4k views
ADD COMMENT
1
Entering edit mode

It would be helpful to paste an example PDB co-ordinate file. I'm not a Perl programmer, so, I won't provide an answer anyway, but it will help the other Perl programmers here.

ADD REPLY
0
Entering edit mode
ATOM      1  N   MET A 183      14.367  14.409 -14.168  1.00 50.39           N  
ATOM      2  CA  MET A 183      14.811  14.674 -15.562  1.00 48.32           C  
ATOM      3  C   MET A 183      13.602  14.830 -16.511  1.00 43.21           C  
ATOM      4  O   MET A 183      13.525  14.177 -17.547  1.00 43.55           O  
ATOM      5  CB  MET A 183      15.718  13.527 -16.033  1.00 48.72           C  
ATOM      6  N   LEU A 184      12.672  15.708 -16.152  1.00 42.00           N  
ATOM      7  CA  LEU A 184      11.447  15.964 -16.970  1.00 39.32           C  
ATOM      8  C   LEU A 184      11.678  16.660 -18.321  1.00 41.97           C  
ATOM      9  O   LEU A 184      10.824  16.619 -19.224  1.00 39.74           O  
ATOM     10  CB  LEU A 184      10.478  16.835 -16.169  1.00 38.33           C  
ATOM     11  CG  LEU A 184       9.839  16.253 -14.920  1.00 36.74           C  
ATOM     12  CD1 LEU A 184       8.881  17.262 -14.332  1.00 38.09           C  
ATOM     13  CD2 LEU A 184       9.103  14.968 -15.262  1.00 35.13           C  
ATOM     14  N   LYS A 185      12.791  17.375 -18.435  1.00 42.50           N  
ATOM     15  CA  LYS A 185      13.161  18.016 -19.691  1.00 41.45           C  
ATOM     16  C   LYS A 185      13.722  16.983 -20.669  1.00 41.88           C  
ATOM     17  O   LYS A 185      13.747  17.251 -21.873  1.00 47.12           O  
ATOM     18  CB  LYS A 185      14.184  19.158 -19.455  1.00 40.64           C  
ATOM     19  N   LYS A 186      14.182  15.814 -20.201  1.00 37.88           N  
ATOM     20  CA  LYS A 186      14.726  14.837 -21.141  1.00 36.11           C  
ATOM     21  C   LYS A 186      13.577  14.319 -22.001  1.00 35.18           C  
ATOM     22  O   LYS A 186      12.478  13.996 -21.479  1.00 34.26           O  
ATOM     23  CB  LYS A 186      15.460  13.672 -20.470  1.00 36.66           C  
ATOM     24  N   LYS A 187      13.842  14.254 -23.298  1.00 33.43           N  
ATOM     25  CA  LYS A 187      12.881  13.836 -24.324  1.00 32.15           C  
ATOM     26  C   LYS A 187      12.824  12.313 -24.346  1.00 30.19           C  
ATOM     27  O   LYS A 187      13.387  11.676 -25.230  1.00 31.65           O  
ATOM     28  CB  LYS A 187      13.337  14.358 -25.694  1.00 34.18           C
ADD REPLY
0
Entering edit mode

And the expected output is? I guess for this example is MLKKK, is that right? Or do you want MMMMMLLLLLLL...?

ADD REPLY
0
Entering edit mode

Why Perl? Usually, when we're addressing a bioinformatics problem, we look at the best tool for the job, not how to do something limiting ourselves to just one specific tool.

ADD REPLY
0
Entering edit mode

I found one more way.

See this site below:

http://www.ebi.ac.uk/pdbe-srv/PDBeXplore/sequence/

Insert a structure ID and/or the author surname:

http://www.ebi.ac.uk/pdbe/entry/pdb/1aa0/

Fetch sequence in the left bottom corner of the page:

VSGLNNAVQNLQVEIGNNSAGIKGQVVALNTLVNGTNPNGSTVEERGLTNSIKANETNIASVTQEVNTAKGNISSLQGDVQALQEAGYIPEAPRDGQAYVRKDGEWVLLSTFL

OR

Quick links (right upper corner)

• 1aa0 overview

http://www.ebi.ac.uk/pdbe/entry/pdb/1aa0/protein/1

Macromolecules:

pdb|1aa0|A VSGLNNAVQNLQVEIGNNSAGIKGQVVALNTLVNGTNPNGSTVEERGLTNSIKANETNIASVTQEVNTAKGNISSLQGDVQALQEAGYIPEAPRDGQAYVRKDGEWVLLSTFL

There are several ways to find a sequence.

ADD REPLY
3
Entering edit mode
6.4 years ago
h.mon 35k

This can get you started, you can then modify it to suit your needs. Save it as get_pdb_aa.pl, and run with get_pdb_aa.pl < file.pdb > out.txt. This will output MMMMMLLLLLLLLKKKKKKKKKKKKKKK for the example you provided.

#!/usr/bin/env perl

use warnings;
use strict;

my %aa_table = (
ala => 'A',arg => 'R',asn => 'N',asp => 'D',
asx => 'B',cys => 'C',glu => 'E',gln => 'Q',
glx => 'Z',gly => 'G',his => 'H',ile => 'I',
leu => 'L',lys => 'K',met => 'M',phe => 'F',
pro => 'P',ser => 'S',thr => 'T',trp => 'W',
tyr => 'Y',val => 'V',
);

foreach my $line ( <STDIN> ) {
    my ($v1, $aa, $v3) = unpack 'A17A3A60', $line;
    print "$aa_table{ lc($aa) }";
}
ADD COMMENT
0
Entering edit mode

can you tell me more, how unpack works in this case? and how it would be if the pdb file lines looks like this: ATOM 1 N N . MET A 1 1 ? 36.644 -24.949 8.853 1.00 29.12 ? 1 MET A N 1 ????????

ADD REPLY

Login before adding your answer.

Traffic: 2942 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6