Question: PSSMs for entire PDB
gravatar for Simon
6 months ago by
Simon0 wrote:


For a machine learning algorithm I would like to have PSSM (position specific scoring matrix) profile for every entry in PDB. Regarding this I have couple questions:

  • Are there any resources (E.g. databases) to bulk download PSSM's?
  • Is it reasonable (I.e logical) to try to calculate PSSM for every entry in PDB using PSIBLAST?

I'm new to bioinformatics, so all comments or suggestions or point outs are welcome!

pssm pdb • 293 views
ADD COMMENTlink modified 6 months ago by Mensur Dlakic5.8k • written 6 months ago by Simon0
gravatar for Mensur Dlakic
6 months ago by
Mensur Dlakic5.8k
Mensur Dlakic5.8k wrote:

For the purposes of machine learning, sequence alignments and/or hidden Markov models of protein families should be just as good - and realistically better - than PSSMs. You can find a PDB database clustered at 70% identity here and the explanation is here.

There is no need to model all sequences, and that goes for all protein databases, not just PDB. As of a month ago, there were almost 0.5 million individual protein chains in PDB. A simple clustering at 95% identity drops that number to ~60 thousand, meaning that more than 85% of protein chains in PDB are 95% identical (or better) to at least one other chain in the database. In other words, there is a huge sequence and structure redundancy in PDB. Depending on your exact task, I think going down to 50% identity clustering would work as well, as almost all sequences that share 50% identity are related. Some people will tell you it is safe to go down to 30-40% identity when clustering.

ADD COMMENTlink written 6 months ago by Mensur Dlakic5.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1058 users visited in the last hour