Question: Trying to use regular expressions to edit a PDB text file.What am I doing wrong?
0
gravatar for westin.kosater
12 months ago by
westin.kosater10 wrote:

I have a text file full of amino acids (CA.txt) as well as some other data. Here is a snippet of the text file

ATOM    109  CA ASER A  48      10.832  19.066  -2.324  0.50 61.96           C  
ATOM    121  CA AALA A  49      12.327  22.569  -2.163  0.50 60.22           C  
ATOM    131  CA AGLN A  50       8.976  24.342  -1.742  0.50 56.71           C  
ATOM    145  CA APRO A  51       7.689  25.565   1.689  0.50 51.89           C  
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C  
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C  
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C  
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C  
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C  
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C  
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C

It is only a 36 KB file. Essentially, my problem is that a few of the amino acids have the letter A in front of them where they are not supposed to be. Amino acid abbreviations are supposed to be 3 letters long. I have attempted to use regular expressions to remove the A at every instance of A in front of an amino acid abbreviation and write the new text to the file "CA-Finale.txt" . Here is my code so far

def Trimmer(txtFileName):
    i = open('CA-Finale.txt', 'w')
    j = open(txtFileName, 'r')
    for record in j:
        with open(txtFileName, 'r') as j:
            content= j.read()
            content_new = re.sub(r'(^ATOM\s+\d+\s+CA\s+)A(\w\w\w)', r'\1\2', content, flags = re.M)

            i.write(content_new)
Trimmer('CA.txt')

When I run this, the text file that is generated is 16.7 MB in size, so something has clearly gone wrong. What could it be?

regex biopython python • 494 views
ADD COMMENTlink modified 12 months ago by Joe15k • written 12 months ago by westin.kosater10

At least part of your problem is that you open the file twice...and the first one never gets closed.

ADD REPLYlink modified 12 months ago • written 12 months ago by Joe15k

Did you try removing/commenting out this line: j = open(txtFileName, 'r') westin.kosater

ADD REPLYlink written 12 months ago by cpad011212k
3
gravatar for Joe
12 months ago by
Joe15k
United Kingdom
Joe15k wrote:

Since you were very close with your python script, I fixed it (you'd done the hard part IMO, which is the RegEx).

import sys
import re

with open(sys.argv[1], 'r') as input:
    for line in input:
        new = re.sub(r'(^ATOM\s+\d+\s+CA\s+)A(\w\w\w)', r'\1 \2', line, flags = re.M) #Added a space between backreferences to account for the deleted char
        print(new.rstrip('\n')) # Remove an additional extraneous newline.

So, with this input:

ATOM    109  CA ASER A  48      10.832  19.066  -2.324  0.50 61.96           C
ATOM    121  CA AALA A  49      12.327  22.569  -2.163  0.50 60.22           C
ATOM    131  CA AGLN A  50       8.976  24.342  -1.742  0.50 56.71           C
ATOM    145  CA APRO A  51       7.689  25.565   1.689  0.50 51.89           C
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C

Invoking the script as python myscript.py inputfile.pdb, yields:

ATOM    109  CA  SER A  48      10.832  19.066  -2.324  0.50 61.96           C
ATOM    121  CA  ALA A  49      12.327  22.569  -2.163  0.50 60.22           C
ATOM    131  CA  GLN A  50       8.976  24.342  -1.742  0.50 56.71           C
ATOM    145  CA  PRO A  51       7.689  25.565   1.689  0.50 51.89           C
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C

This will print to screen (which I prefer), and a new file can be made by redirecting the output: python script.py infile.pdb > outfile.pdb. If you want the script to do that instead, make the following changes:

import sys
import re

with open(sys.argv[1], 'r') as input, open(sys.argv[2], 'w') as output:
    for line in input:
        new = re.sub(r'(^ATOM\s+\d+\s+CA\s+)A(\w\w\w)', r'\1 \2', line, flags = re.M)
        print(new.rstrip('\n')) # Optionally delete this line
        output.write(new.rstrip('\n')
ADD COMMENTlink modified 12 months ago • written 12 months ago by Joe15k
2
gravatar for Pierre Lindenbaum
12 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:
 awk -F '\t'  '/^ATOM/ {OFS="\t";if(length($4)!=3) $4=substr($4,2);print;next}{print;}' pdb.txt
ADD COMMENTlink written 12 months ago by Pierre Lindenbaum124k
0
gravatar for JC
12 months ago by
JC9.1k
Mexico
JC9.1k wrote:
$ perl -ae 'if (length $F[3] > 3) { $new = $F[3]; $new =~ s/^A/ /; s/$F[3]/$new/; } print' < file.pdb
ATOM    109  CA  SER A  48      10.832  19.066  -2.324  0.50 61.96           C
ATOM    121  CA  ALA A  49      12.327  22.569  -2.163  0.50 60.22           C
ATOM    131  CA  GLN A  50       8.976  24.342  -1.742  0.50 56.71           C
ATOM    145  CA  PRO A  51       7.689  25.565   1.689  0.50 51.89           C
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C
ADD COMMENTlink written 12 months ago by JC9.1k
0
gravatar for cpad0112
12 months ago by
cpad011212k
India
cpad011212k wrote:
 awk -v OFS="\t" '{if (length($4)==4){sub("^[A-Z]","", $4)}}1' test.txt

or

awk -v OFS="\t" '{if (length($4)==4){$4=substr($4,2,length($4))}}1' test.txt

ATOM    109 CA  SER A   48  10.832  19.066  -2.324  0.50    61.96   C
ATOM    121 CA  ALA A   49  12.327  22.569  -2.163  0.50    60.22   C
ATOM    131 CA  GLN A   50  8.976   24.342  -1.742  0.50    56.71   C
ATOM    145 CA  PRO A   51  7.689   25.565  1.689   0.50    51.89   C
ATOM    158 CA  GLN A   52  5.174   23.336  3.467   1.00    43.45   C   
ATOM    167 CA  HIS A   53  2.339   24.135  5.889   1.00    38.39   C   
ATOM    177 CA  PHE A   54  0.900   22.203  8.827   1.00    33.79   C   
ATOM    188 CA  TYR A   55  -1.217  22.065  11.975  1.00    34.89   C   
ATOM    200 CA  ALA A   56  0.334   20.465  15.090  1.00    31.84   C   
ATOM    205 CA  VAL A   57  0.000   20.066  18.885  1.00    30.46   C   
ATOM    212 CA  VAL A   58  2.738   21.762  20.915  1.00    27.28   C
ADD COMMENTlink modified 12 months ago • written 12 months ago by cpad011212k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 774 users visited in the last hour