Question: Statistical Analysis Of Protein Sequence Properties
2
9.9 years ago by
User 0063220
User 0063220 wrote:

Hi all,

I've 80 homolog sequences. All the sequences have a structural domain. This domain in same cases have a different length. I'd like to perform a statistical analysis to find out correlation between domain length, sequence length, polar amino acid percentage, basic amino acid percentage, hydrophobic amino acid percentage.

Which statistical test could I use? Could you give me any suggestion about the way to perfom this analysis?

modified 9.9 years ago by Alastair Kerr5.3k • written 9.9 years ago by User 0063220
1

"This domain in same cases have a different length" makes no sense. Do you mean "some cases"?

please change the title of this question. The title of a topic should be such that other persons can understand what you are asking without being forced to open and read the whole question. I can change the title for you, but I prefer if you do it by yourself.

5
9.9 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

Simple things to get you started:

1. Start by plotting the data to look for outliers and trends. Most of us would recommend the R statistical package. To get started with R (and statistics in general), I suggest Introductory Statistics with R by Peter Dalgaard. See also this thread.
2. As for specific tests, consider using Spearman rank correlation to test for relationships between variables. In R, look at the `cor.test()` method. Alternately, consider performing anova tests to test for relationships between variables. E.g. in R:

``````plot(seqlen, pa.percent)
cor.test(seqlen,pa.percent,method="spearman")
anova(lm(seqlen~pa.perecent))
``````

Dalgaard's book will help you interpret the results of these tests, though a proper grounding in the fundamentals of statistics is more important than the particular tool you use.

4
9.9 years ago by
Neilfws49k
Sydney, Australia
Neilfws49k wrote:

As Alastair says, this is a multivariate problem (80 observations x at least 5 variables) and you need to do some exploratory data analysis first.

I'll assume that you are able to calculate the parameters that you described and output a simple data file in e.g. CSV format.

Principal components analysis is a good starting point; it will tell you which factors contribute most to the observed variance. Using R, you'd simply read the CSV file into a data frame and use one of prcomp() or princomp. You could then do, for example, a biplot and see how the observations cluster. In R, I'd also recommend the seqinR package, which contains many methods for sequence analysis.

From there, you'll need to develop some hypotheses that you can test. Would you expect certain factors to correlate, given what you know about protein properties, and why?

If any of this is not familiar to you - and I suspect by the question it is not - you must seek advice from a statistician and/or teach yourself some basic statistics. Blindly applying methods that you don't understand is not the way to go.

3
9.9 years ago by
Alastair Kerr5.3k
Manchester/UK/Cancer Biomarker Centre at CRUK-MI
Alastair Kerr5.3k wrote:

Given you have multiple values per gene Principal Component Analysis [PCA] or correspondence analysis would be a good bet. Just comparing domain length to each of the other values will run the risk of finding a secondary correlation that is a result of a primary trend between your other variables and not domain length.