Score Protein Variants Based On Frequency Of Aa In Multiple Sequence Alignment
Entering edit mode
11.6 years ago
Tim ▴ 340

For reference, please read this excerpt from
Human non-synonymous SNPs: server and survey
Vasily Ramensky, Peer Bork, and Shamil Sunyaev

Profile analysis of homologous sequences. The amino acid replacement may be incompatible with the spectrum of substitutions observed at that position in a family of homologous proteins. PolyPhen identifies homologues of the input sequences via a BLAST (23) search of the NRDB database. The set of aligned sequences with sequence identity to the input sequence in the range 30±94% (inclusive) is used by the new version of the PSIC (position-specific independent counts) software (24) to calculate the so-called profile matrix ( Elements of the matrix (pro- file scores) are logarithmic ratios of the likelihood of a given amino acid occurring at a particular site to the likelihood of this amino acid occurring at any site (background frequency). PolyPhen computes the absolute value of the difference between profile scores of both allelic variants in the polymorphic position. PolyPhen also shows the number of aligned sequences at the query position; this may be used to assess the reliability of profile score calculations.

I'd like to calculate something similar (score variants based on frequency that AA in aligned sequences) to what's mentioned here programmatically, but I can't find any implementation of the above described system.

Does anyone know of a working implementation of this or something similar, that's available either in code or as a web service?

Or should it is easy enough to implement something like this ourselves?

protein sequence multiple scoring • 3.7k views
Entering edit mode
11.5 years ago
Bilouweb ★ 1.1k

I use such profile matrices but I don't know any public implementation, I have done my own in C++. It is not so long to do.

I create an array of array "tab[L][20]" with L the size of the alignment.

Then I read the sequences of the alignment and I count the number of amino acids in each column. I also count the number of gaps. Then I can calculate a log odd score like in Fano [1]

Something to care about is the similarity between the sequences in the alignment. If sequences are too similar then some amino acids might be over-represented at a position. This can introduce a bias in the statistics.

You can read this if you want to see how I use profiles : FROST: a filter-based fold recognition method

[1] Fano RM. Transmition of information: a statistical theory of communication. Cambridge, MA: MIT Press; 1961.

Entering edit mode
11.6 years ago
Chris ★ 1.6k

I'm not sure if I understand you correctly. If you are looking for a webservice that returns the PSIC scoring matrix, why don't you just follow the URL mentioned in the paper's abstract, i.e. which leads you to a html form where you can paste your mutliple alignment and returns the PSIC matrix. Or did I misunderstand you?

Entering edit mode

The form on the above page triggers so that Perl script probably has the code you're looking for. Maybe mail the webmaster ( or the authors of the article for a copy of that code?

Entering edit mode

I want to do this programmatically, so I can do this scoring thousands of times.. Manual wont do, and using curl for this seems hackish, unreliable & sensitive to change.


Login before adding your answer.

Traffic: 2484 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6