Question

Create custom sequence database for HHBLIT from the PDB

0

Entering edit mode

2.5 years ago

sizeineb • 0

Hello,

I am interested in creating a custom sequence database from the PDB for use with HHBLIT. Since I am not familiar with this tools, I have some questions: 1- The custom sequence database should contain only sequences of proteins for which 3D structure is available. In HH-user guide enter link description herethey are using rsync as follow: enter link description here However, this link will download all entries in the PDB, not only ones corresponding to protein structures but also the ones corresponding to nucleic acid only. right? If this is the case, is there a similar easy way to download only the protein files in cif formal?

2- Next, is to generate the sequences of the proteins. They use cif2fasta.py. Since proteins in the PDB may contain mutations and missing parts, is there a way to obtain the FASTA sequences of the downloaded proteins as they are in UniProt database?

Many thanks in advance for your help.

HH-suite PDB MSA FASTA hhblit • 1.6k views

ADD COMMENT • link 2.5 years ago by sizeineb • 0

score 2 · Accepted Answer · 2021-11-23

First of all, I want to recommend strongly that you not do what you are planning. I have been building HHblits-like databases of PDB structures on a monthly basis since 2005. Back then there were other tools to gather and align members, but eventually I switched the whole thing to HHblits. This database has over 100,000 HMMs and gets 300-400 new members each month. Just a monthly update is a fairly large undertaking that requires a lot of computer time and a fairly large RAM. I can't imagine doing it from the scratch on anything smaller than a super-cluster, and it would still take many months. Besides, HHsuite already has such a database based on PDB structures and clustered at 70% identity:

http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/

The latest version is from Nov 17th, 2021 so it isn't even a week old. Please don't take offense, but I can't imagine that you would do a better job at it than HHsuite authors, or that you can dedicate more resources to it than what they already do.

If you still want to go through with this - again, I don't think you should - you may want to consider a different order of steps. To your question #1, I don't think you need to download the whole PDB database - you would be looking at ~180,000 files that are protein structures. This is because there is a huge redundancy in protein structures. There are ways to download all protein sequences of PDB entries without downloading the structures.

https://ftp.wwpdb.org/pub/pdb/derived_data/

You want the file pdb_seqres.txt. Once you download it, I suggest you remove the redundancy at a sequence level before doing anything with structures. When that is done, it will give you only a relatively small number of structures to download and process. Keep in mind that this is very relative, because tens of thousands of structures is still a large number.

As to your question #2, PDB structures in most cases contain links to UniProt numbers, though I don't know of an automatic way to extract them. If you look at my favorite structure, you will see after scrolling down that this structure corresponds to this UniProt entry. That information is likely to be present both in PDB and CIF files and is simply a matter of parsing it out once you settle on a reasonable number of structures. My question to you is why would you want to ignore the mutants and link them to non-mutated UniProt entries? What matters ultimately is the protein sequence in the structure itself, because that is the only thing that can be used for modeling.