Question

Trying to use regular expressions to edit a PDB text file.What am I doing wrong?

0

Entering edit mode

5.5 years ago

westin.kosater ▴ 80

I have a text file full of amino acids (CA.txt) as well as some other data. Here is a snippet of the text file

ATOM    109  CA ASER A  48      10.832  19.066  -2.324  0.50 61.96           C  
ATOM    121  CA AALA A  49      12.327  22.569  -2.163  0.50 60.22           C  
ATOM    131  CA AGLN A  50       8.976  24.342  -1.742  0.50 56.71           C  
ATOM    145  CA APRO A  51       7.689  25.565   1.689  0.50 51.89           C  
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C  
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C  
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C  
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C  
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C  
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C  
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C

It is only a 36 KB file. Essentially, my problem is that a few of the amino acids have the letter A in front of them where they are not supposed to be. Amino acid abbreviations are supposed to be 3 letters long. I have attempted to use regular expressions to remove the A at every instance of A in front of an amino acid abbreviation and write the new text to the file "CA-Finale.txt" . Here is my code so far

def Trimmer(txtFileName):
    i = open('CA-Finale.txt', 'w')
    j = open(txtFileName, 'r')
    for record in j:
        with open(txtFileName, 'r') as j:
            content= j.read()
            content_new = re.sub(r'(^ATOM\s+\d+\s+CA\s+)A(\w\w\w)', r'\1\2', content, flags = re.M)

            i.write(content_new)
Trimmer('CA.txt')

When I run this, the text file that is generated is 16.7 MB in size, so something has clearly gone wrong. What could it be?

python biopython regex • 2.1k views

ADD COMMENT • link updated 5.5 years ago by Joe 21k • written 5.5 years ago by westin.kosater ▴ 80

0

Entering edit mode

At least part of your problem is that you open the file twice...and the first one never gets closed.

ADD REPLY • link 5.5 years ago by Joe 21k

0

Entering edit mode

Did you try removing/commenting out this line: j = open(txtFileName, 'r') westin.kosater

ADD REPLY • link 5.5 years ago by cpad0112 21k

2

Entering edit mode

5.5 years ago

Pierre Lindenbaum 161k

 awk -F '\t'  '/^ATOM/ {OFS="\t";if(length($4)!=3) $4=substr($4,2);print;next}{print;}' pdb.txt

ADD COMMENT • link 5.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

5.5 years ago

JC 13k

$ perl -ae 'if (length $F[3] > 3) { $new = $F[3]; $new =~ s/^A/ /; s/$F[3]/$new/; } print' < file.pdb
ATOM    109  CA  SER A  48      10.832  19.066  -2.324  0.50 61.96           C
ATOM    121  CA  ALA A  49      12.327  22.569  -2.163  0.50 60.22           C
ATOM    131  CA  GLN A  50       8.976  24.342  -1.742  0.50 56.71           C
ATOM    145  CA  PRO A  51       7.689  25.565   1.689  0.50 51.89           C
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C

ADD COMMENT • link 5.5 years ago by JC 13k

0

Entering edit mode

5.5 years ago

cpad0112 21k

 awk -v OFS="\t" '{if (length($4)==4){sub("^[A-Z]","", $4)}}1' test.txt

or

awk -v OFS="\t" '{if (length($4)==4){$4=substr($4,2,length($4))}}1' test.txt

ATOM    109 CA  SER A   48  10.832  19.066  -2.324  0.50    61.96   C
ATOM    121 CA  ALA A   49  12.327  22.569  -2.163  0.50    60.22   C
ATOM    131 CA  GLN A   50  8.976   24.342  -1.742  0.50    56.71   C
ATOM    145 CA  PRO A   51  7.689   25.565  1.689   0.50    51.89   C
ATOM    158 CA  GLN A   52  5.174   23.336  3.467   1.00    43.45   C   
ATOM    167 CA  HIS A   53  2.339   24.135  5.889   1.00    38.39   C   
ATOM    177 CA  PHE A   54  0.900   22.203  8.827   1.00    33.79   C   
ATOM    188 CA  TYR A   55  -1.217  22.065  11.975  1.00    34.89   C   
ATOM    200 CA  ALA A   56  0.334   20.465  15.090  1.00    31.84   C   
ATOM    205 CA  VAL A   57  0.000   20.066  18.885  1.00    30.46   C   
ATOM    212 CA  VAL A   58  2.738   21.762  20.915  1.00    27.28   C

ADD COMMENT • link 5.5 years ago by cpad0112 21k

score 3 · Accepted Answer · 2018-11-02

Since you were very close with your python script, I fixed it (you'd done the hard part IMO, which is the RegEx).

import sys
import re

with open(sys.argv[1], 'r') as input:
    for line in input:
        new = re.sub(r'(^ATOM\s+\d+\s+CA\s+)A(\w\w\w)', r'\1 \2', line, flags = re.M) #Added a space between backreferences to account for the deleted char
        print(new.rstrip('\n')) # Remove an additional extraneous newline.

So, with this input:

ATOM    109  CA ASER A  48      10.832  19.066  -2.324  0.50 61.96           C
ATOM    121  CA AALA A  49      12.327  22.569  -2.163  0.50 60.22           C
ATOM    131  CA AGLN A  50       8.976  24.342  -1.742  0.50 56.71           C
ATOM    145  CA APRO A  51       7.689  25.565   1.689  0.50 51.89           C
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C

Invoking the script as python myscript.py inputfile.pdb, yields:

ATOM    109  CA  SER A  48      10.832  19.066  -2.324  0.50 61.96           C
ATOM    121  CA  ALA A  49      12.327  22.569  -2.163  0.50 60.22           C
ATOM    131  CA  GLN A  50       8.976  24.342  -1.742  0.50 56.71           C
ATOM    145  CA  PRO A  51       7.689  25.565   1.689  0.50 51.89           C
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C

This will print to screen (which I prefer), and a new file can be made by redirecting the output: python script.py infile.pdb > outfile.pdb. If you want the script to do that instead, make the following changes:

import sys
import re

with open(sys.argv[1], 'r') as input, open(sys.argv[2], 'w') as output:
    for line in input:
        new = re.sub(r'(^ATOM\s+\d+\s+CA\s+)A(\w\w\w)', r'\1 \2', line, flags = re.M)
        print(new.rstrip('\n')) # Optionally delete this line
        output.write(new.rstrip('\n')