PSSMs for entire PDB
Entering edit mode
2.5 years ago
Simon • 0


For a machine learning algorithm I would like to have PSSM (position specific scoring matrix) profile for every entry in PDB. Regarding this I have couple questions:

  • Are there any resources (E.g. databases) to bulk download PSSM's?
  • Is it reasonable (I.e logical) to try to calculate PSSM for every entry in PDB using PSIBLAST?

I'm new to bioinformatics, so all comments or suggestions or point outs are welcome!

pssm pdb • 989 views
Entering edit mode
2.5 years ago
Mensur Dlakic ★ 19k

For the purposes of machine learning, sequence alignments and/or hidden Markov models of protein families should be just as good - and realistically better - than PSSMs. You can find a PDB database clustered at 70% identity here and the explanation is here.

There is no need to model all sequences, and that goes for all protein databases, not just PDB. As of a month ago, there were almost 0.5 million individual protein chains in PDB. A simple clustering at 95% identity drops that number to ~60 thousand, meaning that more than 85% of protein chains in PDB are 95% identical (or better) to at least one other chain in the database. In other words, there is a huge sequence and structure redundancy in PDB. Depending on your exact task, I think going down to 50% identity clustering would work as well, as almost all sequences that share 50% identity are related. Some people will tell you it is safe to go down to 30-40% identity when clustering.


Login before adding your answer.

Traffic: 1350 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6