How to get specific information from a list of PDB ID and save all in a spreadsheet?
2
0
Entering edit mode
6 months ago
Welington ▴ 10

Hello everybody. I would like to know if using BioPython or another python library it's possible to get a spreadsheet file (.csv/excel) with certain information that I select like: resolution, R-value, and MAINLY the crystallographic ligand (Biologically Interesting Molecules ). But, being specific, if its possible to do this based on more than one .pdb, that is, a list of PDB's ID. A basic direction will help me immensely, my biggest fear is about the last question. Thank you!

spreadsheet .csv data excel biopython pdb big • 1.1k views
1
Entering edit mode

Is the API of RCSB enough? https://data.rcsb.org/#data-api

0
Entering edit mode

Thanks. I'm giving a look and in a way it's enough, but I can't find some parameters that I would appear to visualize, like "r-value", "chains", "logp" from one target... I'm in the process thank you very much!

4
Entering edit mode
6 months ago
Wayne ★ 1.4k

"if its possible to do this based on more than one .pdb, that is, a list of PDB's ID."

Probably an example that is close to what you intend would help best illustrate how you could scale this. Go here and work through the notebook that comes up there from top to bottom. You can run individual cells by pressing shift-enter while selecting a cell or by selecting a cell and pressing the play button at the top. The Jupyter session has everything already installed and is temporary.*

I show how you can write a single script to get the information you seek for a single PDB id. (Hopefully, by looking over some examples of various ways I've taken such a script the process one PDB code and then scaled I'll convince you that that this first step is the part that you most have to customize for tasks like this. If you process the data into a standardized data from, the downstream steps are pretty much standard.)

Then you have another script that process many PDB ids by repeating thart process over and over. Since that is really what you want to do, for your case, the second script (or one you customize to do what you want) is what you'd want to run. However, behind-the-scenes you'll know it is actually just running the one that processes a single PDB code over and over (or your version of the one that processed each PDB code) and you have some code guiding the input and the output at each end. Your input here would be a list of PDB id codes.

For output, I made Pandas dataframes. These are a common form of dealing with tabular/panel data in the Python world. The great thing is that these easily convert to tabular text data that can be put into spreadsheet or even converted to an Excel spreadsheet directly. The associated notebook I'll provide guides taking the dataframes produced and making the spreadsheet.

I already had the framework in place and so it was just a matter of adapting the two scripts and the notebook. (That being said, you'll want to start small and learn a little at a time. Or reverse-engineer and adapt a little at a time in your case. You'll note how I try to keep things pretty modular or progressing in steps. That makes it easier to build and recombine and adapt later.)

This is really only meant to illustrate the scaling and going to tabular text or spreadsheet form. You'll want to customize it.
In fact, I would encourage you to adapt the code to process the data from the the PDBFinder database suggested if it is possible. It will be much faster if you plan to run a lot of these. I built a delay into my demo because I didn't want to chance slamming the PDBsum site too often if someone took my code and used it with a gigantic list of PDB codes. With everything present on the same system drive already, you could feel free to run it as fast as can be done. So you'd want to delete time.spleep(0.3) from the 'MAKE DATAFRAMES FOR EACH PDB CODE:' section in the second script if you use it as a basis to adapt. Plus that data is much easier to parse. You'll see I do some acrobatic parsing in mine. (I sort of gave an intro to PDBsum in the bottom three paragraphs here. PDBsum's 1eve 'Top page' is what I used in the start of the demo.)

The cycle of taking a dataframe and getting a delimited text table file or a spreadsheet file is fairly straightforward once you see the process spelled out. The only thing that really needs to be changed are the specific file names. One of the few complexities that can get added is styling the dataframe so that the styling shows up in the Excel spreadsheet result, which I won't cover here. I have other notebooks I can point you to where I use it, if interested. Illustrating the straightforward nature of this, is that many of the notebooks in the series offered in my pdbsum-binder repo end with pretty much the same steps of going from dataframe to a tabular text data form that is more universally useable by folks not versed in Python. You can more easily get to the notebooks in the series by going here and pressing the launch binder badge. The page that comes up first lists the available notebooks, including the one I added to illustrate this.

Of course, PDBsum has drawbacks too. One I noted doing this was that the entries for very large structures, such as the ribosome, are very rudimentary with next to no information. Not even the reported resolution of the crystal structure. Because I already had the framework in place using PDBsum, that was why I am using it for this demo.

*FOOTNOTE: When you get around to make anything useful in sessions served by MyBinder, make sure you download it immediately because it will time out after 10 minutes of inactivity, or very rarely it will go poof with little chance of getting your stuff off the remote compute when something goes haywire in the long chain of tech backing this. There is actually a safety net, but it is best to review how it works ahead of time before you need it, see here.

0
Entering edit mode

Wow, nice work at yours notebooks, they are quite complete. This was I need. But, when I try to run another list of PDB's at the Ln[21] of the notebook, I get the error:

--------------------------------------------------------------------------- IndexError Traceback (most recent call last) ~/notebooks/pdb_ids_to_stats_and_info_df.py in <module> 372 373 --> 374 main() 375 376 #** ~/notebooks/pdb_ids_to_stats_and_info_df.py in main() 319 # calling script from command line 320 pdb_ids_to_stats_and_info_df( --> 321 file_name,kwargs) 322 # using https://www.saltycrane.com/blog/2008/01/how-to-use-args-and-kwargs-in-python/#calling-a-function 323 # to build keyword arguments to pass to the function above ~/notebooks/pdb_ids_to_stats_and_info_df.py in pdb_ids_to_stats_and_info_df(file_name, return_df, pickle_df) 251 for pdb_code in pdb_codes_df.pdb_ids: 252 dfs.append(pdbsum_stats_and_info_adpated_example(pdb_code, --> 253 pickle_df=False)) 254 sys.stderr.write("\n") 255 time.sleep(0.3) ~/notebooks/pdbsum_stats_and_info_adpated_example.py in pdbsum_stats_and_info_adpated_example(pdb_code, return_df, pickle_df) 279 Adapted from the main function in blast_to_df.py 280 ''' --> 281 df = get_protein_statsningo_table(pdb_code) 282 283 ~/notebooks/pdbsum_stats_and_info_adpated_example.py in get_protein_statsningo_table(pdb_code) 208 #print(r_value) 209 ligands_section_exp = raw_txt.split( --> 210 'Ligands',1)[1].split("</table>",3) 211 # HTML table to text adapted from https://www.geeksforgeeks.org/convert-html-table-into-csv-file-in-python/ 212 ligands_section = "</table>".join(ligands_section_exp[:2]) IndexError: list index out of range *

I tried this list: pdb_ids_each_online='''7ALI 7ALH 1P9S''' %store pdb_ids_each_online >pdb_ids.txt

1
Entering edit mode

Yes, sorry. I should have pointed out the step #1 script wasn't fully refined; however, my post was quite long as it was.
I hit that error a few times while testing with various PDB ids. It is because both 7ali and 7alh lack ligands. I'll try to find some time soon to adjust the script to handle that gracefully and put 'None' for the 'Ligands' column in the resulting dataframe for such cases. But I thought at this time it was more helpful for you to see how the iterating on multiple ones and making tabular data from the collected information was more important than making the script that handles each PDB id 100% production-ready.
There's another whole category of experimentally determined structures I suspect would also cause an issue right now.
You encountering these types of issues actually illustrates what I was trying to say about the challenge being usually to work out code/script that handles the individual items, in this case PDB codes. If you get that core unit working so that it standardizes the data coming out of that step, the code/script handling the iterating on many and getting it into tabular data is generally the same form for each sort of task like this, counter to the fear you expressed about that aspect in your original post. And if you keep your parts doing the iterating and handling the final output as modular as possible you can recombine and adapt more easily for when you need them next time. (Keep in mind that keeping things general and modular is easier said than done when learning to code. That part usually comes with experience, but it helps to think about early on.)

1
Entering edit mode

I updated the script and associated notebook just now. All minor things except now cases without ligands should be handled much more gracefully. For example, with the PDB identifier 7ali, shouldn't throw a list index out of range error and the column for ligands should now have 'None' for that corresponding row of the dataframe produced.

1
Entering edit mode

I saw, and it's working perfectly! Thanks for the patience and the didactic! :D I learned a lot with this thread :D.

1
Entering edit mode
6 months ago
Mensur Dlakic ★ 20k

PDBFINDER (see here) has a single file that contains all the information you want, and much more. It should be fairly simple to parse it and extract the fields or your interest. A direct link to a text file is below, but beware that some web browsers (Chrome for sure) don't work with FTP addresses, so you may want to do it with wget:

wget ftp://ftp.cmbi.umcn.nl/pub/molbio/data/pdbfinder/PDBFIND.TXT.gz

0
Entering edit mode

I'll try to use this method once I'm in possession of my linux machine. Thank you very much since now!

0
Entering edit mode

One drawback to the PDBFinder database I'm seeing is how often it is updated.
PDBFinder database that link retrieves lists August 17th, 2021 as the second line yet it has no entries added on 2021 as far as I could tell. And I found a bunch where the 'Date' corresponded to 2020, and so I think my search was correct. It's a very minor thing, but I'd also wish they updated their retrieval options to include modern https as the FTP port is blocked on MyBinder. org to limit abuse.

However, you'll see in my answer I stress this is definitely best to use something like that if the OP is planning to scale WAY up.

1
Entering edit mode

They used to update it weekly if I am not mistaken. At the very least it was updated monthly, but it seems to have changed. Still, this is the most carefully parsed PDB snapshot that I have found.