how to compare sets using python (dealing with PDB file)
6.7 years ago
Jason Lin • 0

Hi all,

Sorry to bother you all again. so I have a text file which contains the PDBID and corresponding missing coordinates from PDB file. Such as:

1FZ2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZ4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZ5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZ8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZ9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZH 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

and I have another text file which contains the PDBID and SEG signal (which is the signal indicates to low complexity region in protein sequence). Such as:

1FZ2 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354
1FZ4 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354
1FZ5 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354
1FZ8 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354
1FZ9 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354
1FZH 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354

The numbers in each files are coordinates. so I want to compare those two files and generate a file which contains PDBID or course and corresponding overlap coordinates between SEG signal and missing coordinates.

In this case I want to generate a file like:

1FZ2 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZ4 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZ5 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZ8 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZ9 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZH 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

I have my python code so far:

    total = []

fin = open('file1.txt')      # I want to make the missing coordinates file a set called 'a'
for lines in fin:
l = lines.split()
a = set(l[2:])
print a

with open('file2.txt') as seg_num:     #  I want to make the SEG signal another set called 'b'
for seg_signal in seg_num:
signal = seg_signal.split()
b = set(signal[1:])
print("lol" * 10)
print b
c = a & b                       # and pick the intersection between a and b called c
space = ' '
newlines = '\n'

total.append([signal[0], space, str(c), newlines])

with open('file3.txt', 'w') as f:
for t in total:
f.write(" ".join(t))

f.close()

But for some reason it did not give the desire answer. And I don't know how to fix it.

6.7 years ago

That's how I would do it. IN_PDB file is read in memory as dictionary so the first column is a unique identifier. The common coordinates are found with the list comprehension [x for x in pdb[k] if x in coords]:

#!/usr/bin/env python

IN_PDB= 'pdb.txt'
IN_SEG= 'seg.txt'
OUT_PDB= 'outpdb.txt'

inpdb= open(IN_PDB)
pdb= {}
for line in inpdb:
line= line.strip().split()
pdb[line[0]]= line[1:]
inpdb.close()

outsig= open(OUT_PDB, 'w')
inseg= open(IN_SEG)
for line in inseg:
line= line.strip().split()
k= line[0]
coords= line[1:]
if k in pdb:
common= [x for x in pdb[k] if x in coords]
outsig.write(k + '\t' + '\t'.join(common) + '\n')
outsig.close()
inseg.close()