PSSM features calculation, python or R , any clue?
1
0
Entering edit mode
13 months ago

Hi, I need to calculate PSSM profiles of a number of proteins. I've used biopython but it gives me a matrix of sequence length *20 for a whole multiple sequence alignment ... , actually what I need is a matrix with rows representing each protein sequence in the alignment , with fixed length , so that I can use in machine learning.

I also tried the R package "protr" but it gives me a variable length matrix ..!!

Some references say that it can be calculated using PSI-BLAST

I spent too much time trying to figure out how to calculate it but I couldn't find any reference...

Please, could any one help me to solve this problem ?

PSSM python R protr • 1.1k views
3
Entering edit mode
13 months ago
Mensur Dlakic ★ 14k

PSSMs are not meant for what you want. They are explicitly defined as position-specific scoring matrices (thus PSSM), so they will give a vector of 20 values per each residue. If you want a single row of vectors per sequence, I suggest you Google protein sequence descriptors as those will give you a vector of fixed length per sequence.

You don't have to take my word for it as this is something you can investigate, but I will still tell you that in most machine learning applications PSSM are superior to single-vector descriptors.

0
Entering edit mode

Hello Mensur, thanks for your answer. I see in many publications they use PSSM in machine learning , but I don't know how they input this matrix into the model !! , do they flatten the matrix ?? , but how? PSSM is not of fixed size, so if flattened, still can't be used... that's why I'm so confused

0
Entering edit mode

Most machine learning applications with regard to proteins are per-residue predictors, so PSSMs work perfectly by providing a uniform vector of 20 numbers. In most cases the prediction is whether the residue is burried/exposed; phosphorylated or not; disordered or not; assuming helix, strand or coil, etc. In many instances a window of PSSMs is taken around the residue of interest to account for neighborhood effects. Common window size is 15 (they must be odd-numbered, with a residue to be predicted always in the middle), which will give you a vector of 300 numbers. N-terminal residues will have zeros filled on the left-hand side to fill it up to 300, and C-terminal residues on the right-hand side. I suggest you read up about these ML applications and how exactly they are executed before you spend too much time reinventing the wheel. Couple of labs that may be of interest:

If you are developing a predictor that classifies sequences into classes rather than residues, that probably could still be done by some clever applications of PSSMs. But generally speaking, your probably want to use Pseudo amino acid composition (PseAAC) which will give you a fixed-length vector per sequence regardless of its length. Again, I suggest reading what others have done before you, specifically all early Chou papers.

0
Entering edit mode

Thanks for your useful and detailed explanation.

Actually, I compare a new descriptor with PSSM to validate its performance. I work on a classification problem , I want to classify proteins into categories based on their sequence. The new descriptor achieved high accuracy on the dataset as compared with previous papers which used PSSM.

So, I need to include PSSM to my model to compare the results of both.