Question

Principal component analyses

0

Entering edit mode

8.6 years ago

vicky ▴ 30

Hi All

If some one who has done Principal component analysis (PCA) can tell me what is the minimum number of genomes you need for this analysis. I want to run this analysis on some mammalian genome and was wondering what can be the minimum number, some one told me its around four individual genomes.

Can some one help me to clarify this. I want to do the PCA on whole genome sequences of mammals.

Regards

genome • 2.8k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by vicky ▴ 30

4

Entering edit mode

What are you trying to do ? In general, you want to have much more samples than variables (I guess this is rarely the case with genomes) but there are no hard and fast rules. With too few samples, your PCA results can be unstable. You can get an idea of how stable the results are by using a cross-validation approach where you take a sample away and do PCA. If the PCA is stable, all results should be similar (but remember that PCA solutions are not unique so you would need to rotationally align the solutions to maximize similarity).

ADD REPLY • link 8.6 years ago by Jean-Karim Heriche 27k

Ram · Answer 1 · 2015-09-09

Principal component analysis will identify the largest uncorrelated axes of variation in a data set. Thus, in this post, I will try to add to the previous comment by explaining using examples why the scope of the PCA will be limited to the scope of the dataset. Let us start with something familiar - clothing. This webpage describes a PCA done on dresses, then does a few interesting things with this PCA afterwards.

If you read through this webpage, you will note the sentence "And [the PCA] can't recreate accessories that were not present in the training set (notice the sunglasses and handbag disappear)". In other words, since there were no sunglasses and handbags in the original set, the principal component reconstruction (adding the first few PCs (in this case 70) together for a given newly processed image) cannot recreate them.

For our second example, let us consider something a bit more related to biology and bioinformatics... OK, now imagine you do whole genome sequencing of 100 people, all either African or Asian, and that you get a reference panel of 1000 African and Asian whole genomes to help with the PCA. But, now imagine actually 5 of your people are partially Caucasian.

This axis of variation will be minor since it affects only 5 of 1100 people, and probably wont belong to one of the first few principal components.

Since you did not include any Caucasians in your reference data, you have no context to interpret what that variation is or what it might represent. If you were then to do something like use the top few PCs to control for variation, which is commonly done in genetic studies, you would likely not have included this variation, and thus your study might be susceptible to confounds...

Now, you are suggesting to run a PCA on 4 animals, presumably without any reference mammals of the same type. So, the principal components you get back may accurately describe the variation in these 4 animals, but depending on circumstances and what you are studying, it might be impossible to generalize outside of that...

So, as Jean-Karim suggested, it might help us to know what you are trying to study, and it might help you to find any relevant data from the same organism etc.,, to help place the PCs you get within a larger and more generalizable body of variance belonging to the population you are trying to study.

inally, if you do not include any samples outside of your own subjects, then your PCA will probably not be much more than a descriptive technique of your own data. This could be very useful, for example if you suspect that 12 of your 80 samples might have been contaminated, you might be able to distinguish between them and other samples by doing a PCA and noting whether there a given PC correlates with the suspected contamination status.

any of that help?