Perl Reg-Ex Matching
5
0
Entering edit mode
11.1 years ago

How can I match the following lines exactly in Perl Reg-Ex?

ATOM  10360  H41 C   B 602
ATOM  10361  P   G   B 602
ATOM  10362  C5' G   B 602
ATOM  10363  O5' G   B 602

I tried something like:

/^ATOM\s\s\s[0-9]+\s\s\s...[A-Z]\s/

but this also matches with

ATOM   5248  HB2 SER A 326
ATOM   5249  HG  SER A 326
ATOM   5250  N   LEU A 327
perl • 2.6k views
ADD COMMENT
2
Entering edit mode
11.1 years ago
Neilfws 49k

I assume from the question that:

  • your PDB file contains 2 or more chains
  • of which one or more is protein, one or more is nucleic acid
  • you want to match the nucleic acid, not the protein

You could try this:

^ATOM\s+\d+\s+\w+\s+[ACGT]\s+

This assumes that: (1) second column contains only digits; (2) fourth column contains only A, C, G, T (upper-case).

If you want to match on any upper-case single letter in column 4:

^ATOM\s+\d+\s+\w+\s+[A-Z]{1}\s+

You may also want to look at BioPerl methods for parsing PDB files. I'm having trouble locating a good one-stop resource for that, so you'll have to web search using those terms.

ADD COMMENT
2
Entering edit mode
11.1 years ago
Woa ★ 2.9k

Just keep in mind that the PDB ATOM records are not space/tab delimited but having a fixed width, maybe perl's substr() funtion is a better candidate than the regex matching for file parsing.

ADD COMMENT
1
Entering edit mode
11.1 years ago
SES 8.6k

For this data, you should be using unpack, not substr() and definitely not a regex if you are just trying to parse the file:

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dump qw(dd);

my @data;

while(my $line = <DATA>) {
    if ($line =~ /^ATOM/) {
        push @data, [unpack "A6A7A4A4A2A*", $line];
    }
}

dd @data;

__DATA__
ATOM  10360  H41 C   B 602
ATOM  10361  P   G   B 602
ATOM  10362  C5' G   B 602
ATOM  10363  O5' G   B 602

If you just want to match lines, Perl already gives you line-buffered data so you don't need a regex, just sort your lines. If you want a specific column, sort @data in the above code. Executing this code:

perl biostar64928.pl

gives you the components of each column:

(
  ["ATOM", 10360, "H41", "C", "B", 602],
  ["ATOM", 10361, "P", "G", "B", 602],
  ["ATOM", 10362, "C5'", "G", "B", 602],
  ["ATOM", 10363, "O5'", "G", "B", 602],
)
ADD COMMENT
0
Entering edit mode

Note that real PDB files contain more lines and with different formats than those beginning with ATOM. However, I think this will still work in that case; you just have to pull out the ATOM elements.

ADD REPLY
0
Entering edit mode

If that's the case, your regex would not work but my solution would :). Just answering the question based on what was provided.

ADD REPLY
0
Entering edit mode

Actually your solution will work; edited my comment. And yes, my regex will work since ATOM lines start with ATOM and are well-defined. Suggest you look at some real PDB data :)

ADD REPLY
0
Entering edit mode

My solution does not depend on what the lines start with, that is the point. It is a solution that will work will any fixed width file.

ADD REPLY
0
Entering edit mode

Yes, I see that. It's a good solution, but then you still have to pull out the arrays where the first element is "ATOM". Regex does that for you straight away. It won't be a huge performance hit since PDB file are not very large.

ADD REPLY
0
Entering edit mode

You are correct. I added a line to address your point (although it's silly in this example :) ).

ADD REPLY
0
Entering edit mode
11.1 years ago
diltsjeri ▴ 470

/^ATOM\t[\d]+\t[.]\t[\w]+\t[\w]+\t\d\d\d/

Should work.

ADD COMMENT
0
Entering edit mode

The field separator is not tab. it is multiple space.

ADD REPLY
0
Entering edit mode
11.1 years ago

I think I got it.

/^ATOM\s+[\d]+\s+[A-Z]+\s+\b[A-Z]\b/

Thanks.

ADD COMMENT

Login before adding your answer.

Traffic: 2768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6