Question

PSSMs for entire PDB

0

Entering edit mode

4.3 years ago

Simon • 0

Hello!

For a machine learning algorithm I would like to have PSSM (position specific scoring matrix) profile for every entry in PDB. Regarding this I have couple questions:

Are there any resources (E.g. databases) to bulk download PSSM's?
Is it reasonable (I.e logical) to try to calculate PSSM for every entry in PDB using PSIBLAST?

I'm new to bioinformatics, so all comments or suggestions or point outs are welcome!

pssm pdb • 1.3k views

ADD COMMENT • link updated 4.3 years ago by Mensur Dlakic ★ 27k • written 4.3 years ago by Simon • 0

score 1 · Answer 1 · 2020-01-12

For the purposes of machine learning, sequence alignments and/or hidden Markov models of protein families should be just as good - and realistically better - than PSSMs. You can find a PDB database clustered at 70% identity here and the explanation is here.

There is no need to model all sequences, and that goes for all protein databases, not just PDB. As of a month ago, there were almost 0.5 million individual protein chains in PDB. A simple clustering at 95% identity drops that number to ~60 thousand, meaning that more than 85% of protein chains in PDB are 95% identical (or better) to at least one other chain in the database. In other words, there is a huge sequence and structure redundancy in PDB. Depending on your exact task, I think going down to 50% identity clustering would work as well, as almost all sequences that share 50% identity are related. Some people will tell you it is safe to go down to 30-40% identity when clustering.