Question

Iterating over several PDB files in ProDy module

1

Entering edit mode

2.5 years ago

Jonathan Lefebre ▴ 70

Dear all,

I have been using the Prody module ESSA (http://prody.csb.pitt.edu/tutorials/essa_tutorial/index.html) for single proteins by directly retrieving them from the PDB. This worked fine and as described in the tutorial they provide. Now I would like to extend this and do the same calculation on ~200 structures by iterating over downloaded PDB files in a directory. However, it seems like the parser that prody uses doesn't recognize my downloaded PDB files and I don't understand why. Here is my code:

   # packages

from prody import *
from numpy import *
from matplotlib.pyplot import *
from pandas import *
import os
ion()

directory = '/home/lefebrej95/Documents/PDB_ESSA_test_format'
ext = ('.pdb')
for f in os.listdir(directory):
    if f.endswith(ext):

      fetchPDB(f)  

      atoms = parsePDB(f, compressed = True)
      essa = ESSA()

      essa.setSystem(atoms)

      essa.scanResidues()

      with style.context({'figure.dpi': 600}):
           essa.showESSAProfile()

I am getting this error message, no matter what PDB file I am trying to parse:

runfile('/home/lefebrej95/.config/spyder-py3/untitled1.py', wdir='/home/lefebrej95/.config/spyder-py3')

@> WARNING '7l67.pdb' is not a valid identifier. @> Matching 10 modes across 136 modesets... [ 99%] 1s@> WARNING '7l62.pdb' is not a valid identifier. Traceback (most recent call last):

File "/home/lefebrej95/.config/spyder-py3/untitled1.py", line 35, in <module> atoms = parsePDB(f, compressed = True)

File "/home/lefebrej95/anaconda3/lib/python3.8/site-packages/ProDy-2.0-py3.8-linux-x86_64.egg/prody/proteins/pdbfile.py", line 123, in parsePDB return _parsePDB(pdb[0], **kwargs)

File "/home/lefebrej95/anaconda3/lib/python3.8/site-packages/ProDy-2.0-py3.8-linux-x86_64.egg/prody/proteins/pdbfile.py", line 205, in _parsePDB pdb, chain = _getPDBid(pdb)

File "/home/lefebrej95/anaconda3/lib/python3.8/site-packages/ProDy-2.0-py3.8-linux-x86_64.egg/prody/proteins/pdbfile.py", line 192, in _getPDBid raise IOError('{0} is not a valid filename or a valid PDB '

OSError: 7l62.pdb is not a valid filename or a valid PDB identifier.

Does anyone of you have experience with ProDy or experienced a similar problem already?

Any help is appreciated!

Best regards,

Jonathan

python iteration Prody • 1.4k views

ADD COMMENT • link updated 2.5 years ago by Wayne ★ 2.0k • written 2.5 years ago by Jonathan Lefebre ▴ 70

1

Entering edit mode

I don't use this program, but a fetchPDB command would sound like something that is downloading a file rather than opening it from the local disk. If that's the case, it should be no surprise that your file names are not recognized as valid PDB IDs.

Maybe try the same script but skip the fetchPDB(f) line? And maybe set compressed = False in your next line?

ADD REPLY • link 2.5 years ago by Mensur Dlakic ★ 27k

1

Entering edit mode

I think the issue with it saying "not a valid filename" is that it isn't. You are forgetting the path. Have you tried changing the fetch command to:

fetchPDB(directory+"/"+f)

And combine that with Mensur Dlakic 's suggestion of compressed = False unless your files are tar.gz versions?

Another option is to use those files just to parse out the PDB code and then continue to with the script as-is.

I'm seeing it says OSError: 7l62.pdb is not a valid filename or a valid PDB identifier.. Did you try changing the fetch command to the following?

fetchPDB(f.split(".pdb")[0])

That would would be to use those files just to parse out the PDB code and then continue to with the script as-is. But it will be slower because essentially you are getting the files from somewhere else another time. However, it may not really add much time overall and gives you options.
Plus, this option may allow things to work if the versions if the files you downloaded and have in your directory location aren't quite what fetchPDB and parsePDB want. Right now from what you you've provided I cannot tell if you have compressed versions, full old-style PDB files, or mcif versions of the structure files?

I should add I also don't use this program presently; however, it seems you actually have Python issues here. (If you indeed have the PDB files in the same format the program needs.)

ADD REPLY • link 2.5 years ago by Wayne ★ 2.0k

0

Entering edit mode

Thanks!

adding

atoms = parsePDB(directory + '/' + f)

to the code solved the problem and PDBs are recognized. Also I deleted

fetchPDB()

because it was useless here!

No I only have the problem of python trying to do the calculation on all PDBs in parallel, which kills the computer because of insufficient RAM. I am currently trying to write a workaround to tell python that he should load and calculate each file one-by-one. If anyone of you has a smart idea, this would also be highly appreciated!

Anyways, thanks a lot for your help!

ADD REPLY • link 2.5 years ago by Jonathan Lefebre ▴ 70

1

Entering edit mode

There's probably a setting you can supply when calling the program to limit it to one processor. That will then make it process serially. It's actually a good sign about how the software was developed to see it defaulting to parallel; it is harder to implement yourself when you move to a more powerful system.

Not finding the setting easily though. There's mention of n_cpu=1 here; however, those seem different than commands you are using.

An alternative, hacky way to limit it slamming your system would be to slow things down in your loop. Instead of calling all those tasks, build some time in your loop. Use import time near the top of your script and tell it at the bottom of your loop to sleep long enough for one to process one at time. Inside your loop near bottom add something like:

time.sleep(600)

That makes it sleep 600 seconds or ten minutes before looping to call next task. Adjust the number of seconds as you see fit.

ADD REPLY • link 2.5 years ago by Wayne ★ 2.0k