Question

how to compare sets using python (dealing with PDB file)

0

Entering edit mode

9.7 years ago

Jason Lin • 0

Hi all,

Sorry to bother you all again. so I have a text file which contains the PDBID and corresponding missing coordinates from PDB file. Such as:

1FZ2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZ4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZ5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZ8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZ9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZH 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

and I have another text file which contains the PDBID and SEG signal (which is the signal indicates to low complexity region in protein sequence). Such as:

1FZ2 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 
1FZ4 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 
1FZ5 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 
1FZ8 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 
1FZ9 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 
1FZH 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354

The numbers in each files are coordinates. so I want to compare those two files and generate a file which contains PDBID or course and corresponding overlap coordinates between SEG signal and missing coordinates.

In this case I want to generate a file like:

1FZ2 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZ4 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZ5 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZ8 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZ9 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1FZH 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

I have my python code so far:

    total = []

    fin = open('file1.txt')      # I want to make the missing coordinates file a set called 'a'
    for lines in fin:
        l = lines.split()
        a = set(l[2:])
        print a

    with open('file2.txt') as seg_num:     #  I want to make the SEG signal another set called 'b'
        for seg_signal in seg_num:
            signal = seg_signal.split()
            b = set(signal[1:])
            print("lol" * 10)
            print b
            c = a & b                       # and pick the intersection between a and b called c
            space = ' '
            newlines = '\n'

            total.append([signal[0], space, str(c), newlines])

    with open('file3.txt', 'w') as f:
        for t in total:
            f.write(" ".join(t))

    f.close()

But for some reason it did not give the desired answer. And I don't know how to fix it.

PDB python set SEG • 3.4k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by Jason Lin • 0

score 2 · Answer 1 · 2014-07-25

That's how I would do it. IN_PDB file is read in memory as dictionary so the first column is a unique identifier. The common coordinates are found with the list comprehension [x for x in pdb[k] if x in coords]:

#!/usr/bin/env python

IN_PDB= 'pdb.txt'
IN_SEG= 'seg.txt'
OUT_PDB= 'outpdb.txt'

inpdb= open(IN_PDB)
pdb= {}
for line in inpdb:
    line= line.strip().split()
    pdb[line[0]]= line[1:]
inpdb.close()

outsig= open(OUT_PDB, 'w')
inseg= open(IN_SEG)
for line in inseg:
    line= line.strip().split()
    k= line[0]
    coords= line[1:]
    if k in pdb:
        common= [x for x in pdb[k] if x in coords]
        outsig.write(k + '\t' + '\t'.join(common) + '\n')
outsig.close()
inseg.close()