PSSMs for entire PDB
1
0
Entering edit mode
16 months ago
Simon • 0

Hello!

For a machine learning algorithm I would like to have PSSM (position specific scoring matrix) profile for every entry in PDB. Regarding this I have couple questions:

  • Are there any resources (E.g. databases) to bulk download PSSM's?
  • Is it reasonable (I.e logical) to try to calculate PSSM for every entry in PDB using PSIBLAST?

I'm new to bioinformatics, so all comments or suggestions or point outs are welcome!

pssm pdb • 599 views
ADD COMMENT
1
Entering edit mode
16 months ago
Mensur Dlakic ★ 11k

For the purposes of machine learning, sequence alignments and/or hidden Markov models of protein families should be just as good - and realistically better - than PSSMs. You can find a PDB database clustered at 70% identity here and the explanation is here.

There is no need to model all sequences, and that goes for all protein databases, not just PDB. As of a month ago, there were almost 0.5 million individual protein chains in PDB. A simple clustering at 95% identity drops that number to ~60 thousand, meaning that more than 85% of protein chains in PDB are 95% identical (or better) to at least one other chain in the database. In other words, there is a huge sequence and structure redundancy in PDB. Depending on your exact task, I think going down to 50% identity clustering would work as well, as almost all sequences that share 50% identity are related. Some people will tell you it is safe to go down to 30-40% identity when clustering.

ADD COMMENT

Login before adding your answer.

Traffic: 2743 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6