Question

How to get specific information from a list of PDB ID and save all in a spreadsheet?

0

Entering edit mode

2.1 years ago

Welington ▴ 10

Hello everybody

I would like to know if using BioPython or another python library it's possible to get a spreadsheet file (.csv/excel) with certain information that I select like: resolution, R-value, and MAINLY the crystallographic ligand (Biologically Interesting Molecules ). But, being specific, if its possible to do this based on more than one .pdb, that is, a list of PDB's ID. A basic direction will help me immensely, my biggest fear is about the last question. Thank you!

csv biopython pdb • 2.5k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 2.1 years ago by Welington ▴ 10

1

Entering edit mode

Is the API of RCSB enough? https://data.rcsb.org/#data-api

ADD REPLY • link 2.1 years ago by Shred ★ 1.4k

0

Entering edit mode

Thanks. I'm giving a look and in a way it's enough, but I can't find some parameters that I would appear to visualize, like "r-value", "chains", "logp" from one target... I'm in the process thank you very much!

ADD REPLY • link 2.1 years ago by Welington ▴ 10

1

Entering edit mode

2.1 years ago

Mensur Dlakic ★ 27k

PDBFINDER (see here) has a single file that contains all the information you want, and much more. It should be fairly simple to parse it and extract the fields or your interest. A direct link to a text file is below, but beware that some web browsers (Chrome for sure) don't work with FTP addresses, so you may want to do it with wget:

wget ftp://ftp.cmbi.umcn.nl/pub/molbio/data/pdbfinder/PDBFIND.TXT.gz

ADD COMMENT • link 2.1 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

I'll try to use this method once I'm in possession of my linux machine. Thank you very much since now!

ADD REPLY • link 2.1 years ago by Welington ▴ 10

0

Entering edit mode

One drawback to the PDBFinder database I'm seeing is how often it is updated.
PDBFinder database that link retrieves lists August 17th, 2021 as the second line yet it has no entries added on 2021 as far as I could tell. And I found a bunch where the 'Date' corresponded to 2020, and so I think my search was correct. It's a very minor thing, but I'd also wish they updated their retrieval options to include modern https as the FTP port is blocked on MyBinder. org to limit abuse.

However, you'll see in my answer I stress this is definitely best to use something like that if the OP is planning to scale WAY up.

ADD REPLY • link 2.1 years ago by Wayne ★ 2.0k

1

Entering edit mode

They used to update it weekly if I am not mistaken. At the very least it was updated monthly, but it seems to have changed. Still, this is the most carefully parsed PDB snapshot that I have found.

ADD REPLY • link 2.1 years ago by Mensur Dlakic ★ 27k

score 4 · Accepted Answer · 2022-03-18

"if its possible to do this based on more than one .pdb, that is, a list of PDB's ID."

Probably an example that is close to what you intend would help best illustrate how you could scale this. Go here and work through the notebook that comes up there from top to bottom. You can run individual cells by pressing shift-enter while selecting a cell or by selecting a cell and pressing the play button at the top. The Jupyter session has everything already installed and is temporary.*

I show how you can write a single script to get the information you seek for a single PDB id. (Hopefully, by looking over some examples of various ways I've taken such a script the process one PDB code and then scaled I'll convince you that that this first step is the part that you most have to customize for tasks like this. If you process the data into a standardized data from, the downstream steps are pretty much standard.)

Then you have another script that process many PDB ids by repeating thart process over and over. Since that is really what you want to do, for your case, the second script (or one you customize to do what you want) is what you'd want to run. However, behind-the-scenes you'll know it is actually just running the one that processes a single PDB code over and over (or your version of the one that processed each PDB code) and you have some code guiding the input and the output at each end. Your input here would be a list of PDB id codes.

For output, I made Pandas dataframes. These are a common form of dealing with tabular/panel data in the Python world. The great thing is that these easily convert to tabular text data that can be put into spreadsheet or even converted to an Excel spreadsheet directly. The associated notebook I'll provide guides taking the dataframes produced and making the spreadsheet.

I already had the framework in place and so it was just a matter of adapting the two scripts and the notebook. (That being said, you'll want to start small and learn a little at a time. Or reverse-engineer and adapt a little at a time in your case. You'll note how I try to keep things pretty modular or progressing in steps. That makes it easier to build and recombine and adapt later.)

This is really only meant to illustrate the scaling and going to tabular text or spreadsheet form. You'll want to customize it.
In fact, I would encourage you to adapt the code to process the data from the the PDBFinder database suggested if it is possible. It will be much faster if you plan to run a lot of these. I built a delay into my demo because I didn't want to chance slamming the PDBsum site too often if someone took my code and used it with a gigantic list of PDB codes. With everything present on the same system drive already, you could feel free to run it as fast as can be done. So you'd want to delete time.spleep(0.3) from the 'MAKE DATAFRAMES FOR EACH PDB CODE:' section in the second script if you use it as a basis to adapt. Plus that data is much easier to parse. You'll see I do some acrobatic parsing in mine. (I sort of gave an intro to PDBsum in the bottom three paragraphs here. PDBsum's 1eve 'Top page' is what I used in the start of the demo.)

The cycle of taking a dataframe and getting a delimited text table file or a spreadsheet file is fairly straightforward once you see the process spelled out. The only thing that really needs to be changed are the specific file names. One of the few complexities that can get added is styling the dataframe so that the styling shows up in the Excel spreadsheet result, which I won't cover here. I have other notebooks I can point you to where I use it, if interested. Illustrating the straightforward nature of this, is that many of the notebooks in the series offered in my pdbsum-binder repo end with pretty much the same steps of going from dataframe to a tabular text data form that is more universally useable by folks not versed in Python. You can more easily get to the notebooks in the series by going here and pressing the launch binder badge. The page that comes up first lists the available notebooks, including the one I added to illustrate this.

Of course, PDBsum has drawbacks too. One I noted doing this was that the entries for very large structures, such as the ribosome, are very rudimentary with next to no information. Not even the reported resolution of the crystal structure. Because I already had the framework in place using PDBsum, that was why I am using it for this demo.

*FOOTNOTE: When you get around to make anything useful in sessions served by MyBinder, make sure you download it immediately because it will time out after 10 minutes of inactivity, or very rarely it will go poof with little chance of getting your stuff off the remote compute when something goes haywire in the long chain of tech backing this. There is actually a safety net, but it is best to review how it works ahead of time before you need it, see here.