3
9.3 years ago by
Andrea_Bio2.6k
Andrea_Bio2.6k wrote:

Hello

I am returning to bioinformatics after a long absence and would like some advice on what sort of statistics/probability I need to learn to help me with my research.

Please bear in mind that I haven't 'studied' statistics or maths really since school so I will have to go right back to the basics but I'm pretty sure I can handle some degree of difficulty as I know I taught myself how to do delta/epsilon proofs of limits in calculus 'for fun' once so I can't be that bad.

So assume I am someone with no knowledge but capable of learning.

The other problem is that my research topic is hazy. I'm going to have lots of SNP data available for multiple individuals from different populations and I'd like to be able to do something meaningful with it. However its hard for me to think of potential directions when I don't have the maths skills to know what I could do with my data.

Perhaps I just need to read a basic statistics book and then build from there but some pointers of interesting directions in statistics and probability would be extremely useful

Many thanks

statistics • 1.8k views
written 9.3 years ago by Andrea_Bio2.6k

this does not seem like a question that could be answered

what's the question?

6
9.3 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

I think the first issue to address is to make your research topic less hazy. Rather than asking "what tools can I apply to my data?", you should ask "what do I want to get out of my data?" Once you know that, the appropriate tools will suggest themselves and you can begin to learn the specific statistical methods that you need.

Let me illustrate with an example. When I first started out in bioinformatics, I worked on microbial genomics. We generated a lot of different kinds of data: genome sequence, predicted protein and RNA sequences, homology models of proteins; but we did not know what we wanted to do with it. The key thing about our organisms was that they grew in cold environments. So the question we needed to ask was: are there any features in these data that might explain cold adaptation, or that might differ from organisms that grow in warmer environments?

Once we had a question, we were able to search the literature more effectively. We discovered that other people had used a statistical technique, principal components analysis (PCA), to analyse protein amino acid composition, but this had never been applied to proteins from cold-adapted organisms. At the time I knew very little about statistics, PCA or how to use R but like you I was a "capable learner". So I sat down, taught myself the basics of R usage and soon was able to do PCA on our data.

Once I got over that initial hurdle, I found that I was able to "think statistically" about data: what kinds of analyses might be appropriate, whether the required data were available and how to implement what I wanted using R. R programming skills, by the way, are acquired simply by practice. Every day - open an R console, read in a file, explore the documentation and do some simple analyses and plotting. Soon it becomes second nature.

In summary: good data first, then good questions, then good statistics.

2
9.3 years ago by
Blunders1.1k
Blunders1.1k wrote:

It's a little dated, but it's possible me posting an answer will lead to others posting as well. Also, this page appears to list a number of good documents on the subject:

Computational And Statistical Methods In Bioinformatics (not a PDF, HTML)