Principal Component Analysis On A Multiple Alignment
2
2
Entering edit mode
13.2 years ago
Jamand ▴ 110

Dear All

I have perfomed a principal components analysis on a MSA. Each residue has been substituted by a vector of 5 values which represents properties of that specific amino acid, according to Altchely matrix. Columns with gaps have been removed. The goal is to identify specificity determining residues (SDP). I used Nipals algorithm, cross validation method to select PCs and I got a model with 11 PCs. I have some difficulty in giving an interpetation to my analysis. How can I identify important positions in the alignment by the results of my PCA analysis?

statistics alignment multiple • 5.3k views
0
Entering edit mode

Jalview has PCA option on MSA, See if it helps....

2
Entering edit mode
13.2 years ago
Casbon ★ 3.3k

Interesting idea!

Your original basis has n*5 dimensions, where n is the number of positions in the alignment, right?

If you look at the principal components in terms of the original basis, look for the large values. If a value is large this means the variation is significant along that vector in the original space so work out which residue that belongs to. i.e. lets say Aij is the value for the ith base and the jth Altcehly value. I get a principal component with the largest values A12 and A34, then I would look at bases 1 and 3 as significant.

Since you have five values per base, you really want to summarise the significance across the base with pythagoras (since these are essentially components in a vector). i.e the signifiance of base 1 is A11^2 + A12^2 + A13^2 + A14^2 + A15^2

However, I don't know how normal the Altechly values are, you may have to normalise to get meaningful data.

0
Entering edit mode

I am used to see euclidian distance to have the significance : sqrt(A11^2 + A12^2 + A13^2 + A14^2 + A15^2). Is it so different ? I think not.

1
Entering edit mode
13.2 years ago
User 0063 ▴ 240

Hi,

Here is Altchely matrix

A;-0,591;-1,302;-0,733;1,57;-0,146
C;-1,343;0,465;-0,862;-1,020;-0,255
D;1,05;0,302;-3,656;-0,259;-3,242
E;1,357;-1,453;1,477;0,113;-0,837
F;-1,006;-0,590;1,891;-0,397;0,412
G;-0,384;1,652;1,33;1,045;2,064
H;0,336;-0,417;-1,673;-1,474;-0,078
I;-1,239;-0,547;2,131;0,393;0,816
K;1,831;-0,561;0,533;-0,277;1,648
L;-1,019;-0,987;-1,505;1,266;-0,912
M;-0,663;-1,524;2,219;-1,005;1,212
N;0,945;0,828;1,299;-0,169;0,933
P;0,189;2,081;-1,628;0,421;-1,392
Q;0,931;-0,179;-3,005;-0,503;-1,853
R;1,538;-0,055;1,502;0,44;2,897
S;-0,228;1,399;-4,760;0,67;-2,647
T;-0,032;0,326;2,213;0,908;1,313
V;-1,337;-0,279;-0,544;1,242;-1,262
W;-0,595;0,009;0,672;-2,128;-0,184
Y;0,26;0,83;3,097;-0,838;1,512


Factor 1 is termed the polarity index. It correlates well with a large variety of descriptors including the number of hydrogen bond donors, polarity versus nonpolarity, and hydrophobicity versus hydrophilicity.

Factor 2 is a secondarystructure index. It represents the propensity of an amino acid to be in a particular type of secondary structure, such as a coil, turn or bend versus the frequency of it in an α-helix.

Factor 3 is correlated with molecular size,volume and molecular weight.

Factor 4 reflects the number of codons coding for an amino acid and amino acid composition. These attributes are related to various physical properties including refractivity and heat capacity.

Factor 5 is related to the electrostatic charge.

I wrote some code to substitute aa with numeric values in my MSA. So I get about 1860 variable named F1_1, F1_2....F1_5,F2_1,F2_2....F2_5 an do so on...

Because of the great number of variables (each column in the MSA *5), I perfomed a PCA using NIPALS.

I got 11 PCS.

I had a glance at the variable importance and I got a table with all residues (F1_1,F1_2 ...), their power (varying from 0 to 1) and their importance in the analysis, all in descrent order.

Then I had a glance at loadings matrix. Here I found 11 variables (the PCs) and cases on the rows, that assumed negative and positive values.

The score matrix in composed by 11 variables (PC) and the sequence names.

How could identify residues determining specificity? I think that more discriminating residues among groups, are probably the ones that determine specificity.

Should I perform a PCA with a categorized values to group protein in subfamily?

Should I perform a discriminant analysis on the PCA resulting matrix?