Question: Trying to use regular expressions to edit a PDB text file.What am I doing wrong?
0
gravatar for westin.kosater
4 months ago by
westin.kosater10 wrote:

I have a text file full of amino acids (CA.txt) as well as some other data. Here is a snippet of the text file

ATOM    109  CA ASER A  48      10.832  19.066  -2.324  0.50 61.96           C  
ATOM    121  CA AALA A  49      12.327  22.569  -2.163  0.50 60.22           C  
ATOM    131  CA AGLN A  50       8.976  24.342  -1.742  0.50 56.71           C  
ATOM    145  CA APRO A  51       7.689  25.565   1.689  0.50 51.89           C  
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C  
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C  
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C  
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C  
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C  
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C  
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C

It is only a 36 KB file. Essentially, my problem is that a few of the amino acids have the letter A in front of them where they are not supposed to be. Amino acid abbreviations are supposed to be 3 letters long. I have attempted to use regular expressions to remove the A at every instance of A in front of an amino acid abbreviation and write the new text to the file "CA-Finale.txt" . Here is my code so far

def Trimmer(txtFileName):
    i = open('CA-Finale.txt', 'w')
    j = open(txtFileName, 'r')
    for record in j:
        with open(txtFileName, 'r') as j:
            content= j.read()
            content_new = re.sub(r'(^ATOM\s+\d+\s+CA\s+)A(\w\w\w)', r'\1\2', content, flags = re.M)

            i.write(content_new)
Trimmer('CA.txt')

When I run this, the text file that is generated is 16.7 MB in size, so something has clearly gone wrong. What could it be?

regex biopython python • 282 views
ADD COMMENTlink modified 4 months ago by jrj.healey11k • written 4 months ago by westin.kosater10

At least part of your problem is that you open the file twice...and the first one never gets closed.

ADD REPLYlink modified 4 months ago • written 4 months ago by jrj.healey11k

Did you try removing/commenting out this line: j = open(txtFileName, 'r') westin.kosater

ADD REPLYlink written 4 months ago by cpad011211k
3
gravatar for jrj.healey
4 months ago by
jrj.healey11k
United Kingdom
jrj.healey11k wrote:

Since you were very close with your python script, I fixed it (you'd done the hard part IMO, which is the RegEx).

import sys
import re

with open(sys.argv[1], 'r') as input:
    for line in input:
        new = re.sub(r'(^ATOM\s+\d+\s+CA\s+)A(\w\w\w)', r'\1 \2', line, flags = re.M) #Added a space between backreferences to account for the deleted char
        print(new.rstrip('\n')) # Remove an additional extraneous newline.

So, with this input:

ATOM    109  CA ASER A  48      10.832  19.066  -2.324  0.50 61.96           C
ATOM    121  CA AALA A  49      12.327  22.569  -2.163  0.50 60.22           C
ATOM    131  CA AGLN A  50       8.976  24.342  -1.742  0.50 56.71           C
ATOM    145  CA APRO A  51       7.689  25.565   1.689  0.50 51.89           C
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C

Invoking the script as python myscript.py inputfile.pdb, yields:

ATOM    109  CA  SER A  48      10.832  19.066  -2.324  0.50 61.96           C
ATOM    121  CA  ALA A  49      12.327  22.569  -2.163  0.50 60.22           C
ATOM    131  CA  GLN A  50       8.976  24.342  -1.742  0.50 56.71           C
ATOM    145  CA  PRO A  51       7.689  25.565   1.689  0.50 51.89           C
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C

This will print to screen (which I prefer), and a new file can be made by redirecting the output: python script.py infile.pdb > outfile.pdb. If you want the script to do that instead, make the following changes:

import sys
import re

with open(sys.argv[1], 'r') as input, open(sys.argv[2], 'w') as output:
    for line in input:
        new = re.sub(r'(^ATOM\s+\d+\s+CA\s+)A(\w\w\w)', r'\1 \2', line, flags = re.M)
        print(new.rstrip('\n')) # Optionally delete this line
        output.write(new.rstrip('\n')
ADD COMMENTlink modified 4 months ago • written 4 months ago by jrj.healey11k
2
gravatar for Pierre Lindenbaum
4 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:
 awk -F '\t'  '/^ATOM/ {OFS="\t";if(length($4)!=3) $4=substr($4,2);print;next}{print;}' pdb.txt
ADD COMMENTlink written 4 months ago by Pierre Lindenbaum118k
0
gravatar for JC
4 months ago by
JC7.6k
Mexico
JC7.6k wrote:
$ perl -ae 'if (length $F[3] > 3) { $new = $F[3]; $new =~ s/^A/ /; s/$F[3]/$new/; } print' < file.pdb
ATOM    109  CA  SER A  48      10.832  19.066  -2.324  0.50 61.96           C
ATOM    121  CA  ALA A  49      12.327  22.569  -2.163  0.50 60.22           C
ATOM    131  CA  GLN A  50       8.976  24.342  -1.742  0.50 56.71           C
ATOM    145  CA  PRO A  51       7.689  25.565   1.689  0.50 51.89           C
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C
ADD COMMENTlink written 4 months ago by JC7.6k
0
gravatar for cpad0112
4 months ago by
cpad011211k
India
cpad011211k wrote:
 awk -v OFS="\t" '{if (length($4)==4){sub("^[A-Z]","", $4)}}1' test.txt

or

awk -v OFS="\t" '{if (length($4)==4){$4=substr($4,2,length($4))}}1' test.txt

ATOM    109 CA  SER A   48  10.832  19.066  -2.324  0.50    61.96   C
ATOM    121 CA  ALA A   49  12.327  22.569  -2.163  0.50    60.22   C
ATOM    131 CA  GLN A   50  8.976   24.342  -1.742  0.50    56.71   C
ATOM    145 CA  PRO A   51  7.689   25.565  1.689   0.50    51.89   C
ATOM    158 CA  GLN A   52  5.174   23.336  3.467   1.00    43.45   C   
ATOM    167 CA  HIS A   53  2.339   24.135  5.889   1.00    38.39   C   
ATOM    177 CA  PHE A   54  0.900   22.203  8.827   1.00    33.79   C   
ATOM    188 CA  TYR A   55  -1.217  22.065  11.975  1.00    34.89   C   
ATOM    200 CA  ALA A   56  0.334   20.465  15.090  1.00    31.84   C   
ATOM    205 CA  VAL A   57  0.000   20.066  18.885  1.00    30.46   C   
ATOM    212 CA  VAL A   58  2.738   21.762  20.915  1.00    27.28   C
ADD COMMENTlink modified 4 months ago • written 4 months ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1162 users visited in the last hour