Trying to use regular expressions to edit a PDB text file.What am I doing wrong?
4
0
Entering edit mode
3.1 years ago

I have a text file full of amino acids (CA.txt) as well as some other data. Here is a snippet of the text file

ATOM    109  CA ASER A  48      10.832  19.066  -2.324  0.50 61.96           C  
ATOM    121  CA AALA A  49      12.327  22.569  -2.163  0.50 60.22           C  
ATOM    131  CA AGLN A  50       8.976  24.342  -1.742  0.50 56.71           C  
ATOM    145  CA APRO A  51       7.689  25.565   1.689  0.50 51.89           C  
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C  
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C  
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C  
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C  
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C  
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C  
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C

It is only a 36 KB file. Essentially, my problem is that a few of the amino acids have the letter A in front of them where they are not supposed to be. Amino acid abbreviations are supposed to be 3 letters long. I have attempted to use regular expressions to remove the A at every instance of A in front of an amino acid abbreviation and write the new text to the file "CA-Finale.txt" . Here is my code so far

def Trimmer(txtFileName):
    i = open('CA-Finale.txt', 'w')
    j = open(txtFileName, 'r')
    for record in j:
        with open(txtFileName, 'r') as j:
            content= j.read()
            content_new = re.sub(r'(^ATOM\s+\d+\s+CA\s+)A(\w\w\w)', r'\1\2', content, flags = re.M)

            i.write(content_new)
Trimmer('CA.txt')

When I run this, the text file that is generated is 16.7 MB in size, so something has clearly gone wrong. What could it be?

python biopython regex • 1.2k views
ADD COMMENT
0
Entering edit mode

At least part of your problem is that you open the file twice...and the first one never gets closed.

ADD REPLY
0
Entering edit mode

Did you try removing/commenting out this line: j = open(txtFileName, 'r') westin.kosater

ADD REPLY
3
Entering edit mode
3.1 years ago
Joe 19k

Since you were very close with your python script, I fixed it (you'd done the hard part IMO, which is the RegEx).

import sys
import re

with open(sys.argv[1], 'r') as input:
    for line in input:
        new = re.sub(r'(^ATOM\s+\d+\s+CA\s+)A(\w\w\w)', r'\1 \2', line, flags = re.M) #Added a space between backreferences to account for the deleted char
        print(new.rstrip('\n')) # Remove an additional extraneous newline.

So, with this input:

ATOM    109  CA ASER A  48      10.832  19.066  -2.324  0.50 61.96           C
ATOM    121  CA AALA A  49      12.327  22.569  -2.163  0.50 60.22           C
ATOM    131  CA AGLN A  50       8.976  24.342  -1.742  0.50 56.71           C
ATOM    145  CA APRO A  51       7.689  25.565   1.689  0.50 51.89           C
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C

Invoking the script as python myscript.py inputfile.pdb, yields:

ATOM    109  CA  SER A  48      10.832  19.066  -2.324  0.50 61.96           C
ATOM    121  CA  ALA A  49      12.327  22.569  -2.163  0.50 60.22           C
ATOM    131  CA  GLN A  50       8.976  24.342  -1.742  0.50 56.71           C
ATOM    145  CA  PRO A  51       7.689  25.565   1.689  0.50 51.89           C
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C

This will print to screen (which I prefer), and a new file can be made by redirecting the output: python script.py infile.pdb > outfile.pdb. If you want the script to do that instead, make the following changes:

import sys
import re

with open(sys.argv[1], 'r') as input, open(sys.argv[2], 'w') as output:
    for line in input:
        new = re.sub(r'(^ATOM\s+\d+\s+CA\s+)A(\w\w\w)', r'\1 \2', line, flags = re.M)
        print(new.rstrip('\n')) # Optionally delete this line
        output.write(new.rstrip('\n')
ADD COMMENT
2
Entering edit mode
3.1 years ago
 awk -F '\t'  '/^ATOM/ {OFS="\t";if(length($4)!=3) $4=substr($4,2);print;next}{print;}' pdb.txt
ADD COMMENT
0
Entering edit mode
3.1 years ago
JC 12k
$ perl -ae 'if (length $F[3] > 3) { $new = $F[3]; $new =~ s/^A/ /; s/$F[3]/$new/; } print' < file.pdb
ATOM    109  CA  SER A  48      10.832  19.066  -2.324  0.50 61.96           C
ATOM    121  CA  ALA A  49      12.327  22.569  -2.163  0.50 60.22           C
ATOM    131  CA  GLN A  50       8.976  24.342  -1.742  0.50 56.71           C
ATOM    145  CA  PRO A  51       7.689  25.565   1.689  0.50 51.89           C
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C
ADD COMMENT
0
Entering edit mode
3.1 years ago
 awk -v OFS="\t" '{if (length($4)==4){sub("^[A-Z]","", $4)}}1' test.txt

or

awk -v OFS="\t" '{if (length($4)==4){$4=substr($4,2,length($4))}}1' test.txt

ATOM    109 CA  SER A   48  10.832  19.066  -2.324  0.50    61.96   C
ATOM    121 CA  ALA A   49  12.327  22.569  -2.163  0.50    60.22   C
ATOM    131 CA  GLN A   50  8.976   24.342  -1.742  0.50    56.71   C
ATOM    145 CA  PRO A   51  7.689   25.565  1.689   0.50    51.89   C
ATOM    158 CA  GLN A   52  5.174   23.336  3.467   1.00    43.45   C   
ATOM    167 CA  HIS A   53  2.339   24.135  5.889   1.00    38.39   C   
ATOM    177 CA  PHE A   54  0.900   22.203  8.827   1.00    33.79   C   
ATOM    188 CA  TYR A   55  -1.217  22.065  11.975  1.00    34.89   C   
ATOM    200 CA  ALA A   56  0.334   20.465  15.090  1.00    31.84   C   
ATOM    205 CA  VAL A   57  0.000   20.066  18.885  1.00    30.46   C   
ATOM    212 CA  VAL A   58  2.738   21.762  20.915  1.00    27.28   C
ADD COMMENT

Login before adding your answer.

Traffic: 2493 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6