Forum:How do I explain to my non-bioinformatician labmates how to interpret principal components?
Entering edit mode
8.4 years ago

I'm training to become a bioinformatician in a lab where everyone else is a traditional wet lab scientist with little math/analysis/programming background. Recently, I did an RNA-Seq experiment on three tissue types - CNV deletion, CNV duplication, and control CNV - and I applied a PCA as an exploratory analysis to see if the CNV has sufficient impact on the transcriptome of the samples that CNV-DEL, CNV-DUP, and CNV-CTRL cluster separately.

When he saw my PCA plot, a labmate asked what a principal component is. Here's how I tried to explain it: A principal component is like a line composed of many variables (in this cases, the 171,000 transcripts in my data) that captures an element of variation in the data. Each principal component has a unique pattern of weights assigned to each variable, and these weights are designed such that the PCs are uncorrelated to each other - that is, the PCs are orthogonal. BUT - the genes with the highest weight in the PC are not inherently biologically meaningful.

My labmates insist that the principal components must contain biologically relevant information about relationships between genes. I think their idea is that I could list which genes have the highest weights for a given PC and then submit them to DAVID and then conclude "genes involved in metabolic pathways are responsible for the greatest variation in our data."

I think their logic is like this: In our data, we expect CNV status will produce most of the variation in expression data between samples. The first principal component should describe which genes are most involved in producing the most variation between samples. Therefore, the genes heavily weighted in the first principal component should be somehow related to the genes affected by CNV status.

What's missing/wrong from what I'm saying about principal components? How can I explain the appropriate interpretation of PCA to my lab? What alternative analyses I could apply if I want to make a claim about underlying patterns of gene expression that distinguish the three conditions?


RNA-Seq communication analysis pca • 16k views
Entering edit mode
8.4 years ago

Ah, you tried to explain the math. There's your problem...

Seriously, it may be that your example is too simple for your colleagues to grasp. You had one variable (tissue type) with three states, and used PCA to see how well the observed variation corresponded with those states. Presumably, it did, so it seems only logical that the genes most directly affected by the tissue type would be the ones weighted most heavily. In this case, it's hard to grasp the orthogonality.

Perhaps you can provide a more intuitive understanding of PCA with a thought experiment: two tissues (e.g., brain and muscle) with two treatments (fed and starved). Ask them to predict how many clusters PCA would yield (hopefully, they say four!). Then, ask them to describe the properties of a gene that would be most useful for discriminating those four clusters (i.e., which would be given the highest weight). It would be one that reflects both the tissue AND treatment. A different gene may be a much stronger predictor of one variable (e.g., myoglobin for tissue type) but unchanged for the other, and would therefore be assigned less weight. You can also have them predict which clusters would be most similar (tissue), to understand that the principal components describe how much each variable contributes to the total observed variation.

For the more mathematically inclined, Lior Pachter has a great blog post about PCA here.

Entering edit mode
8.4 years ago

I recently read a nice paper where the author did a very good use of PCA to analyze expression data. The paper is:

Fehrman et al, Nat Gen 2014.
Gene Expression analysis identifies global gene dosage sensitivity in cancer. Available at

I even prepared a journal club on this. In order to explain the paper to my colleagues, I dedicated a few slides to explain how PCA can be interpreted biologically. You can check the slides here. (sorry if a few things are oversimplified there - I didn't have much time to explain)

In the paper, they take a large dataset of expression profiles from multiple sources, classify the tissue of each sample, and do a big PCA on all the samples. I am not sure I remember correctly, but at this stage they only use normal (non-disease) samples. Then, they interpret each PC has a "transcriptomic profile", e.g. genes having a similar pattern of expression variation in all the samples analyzed. For example they say that their 3rd component represents genes expressed in the brain. They also do a separate GSEA for every component, to identify which biological processes are enriched in each component. They also use the eigenvector coefficient as a "wiring" coefficient - e.g. how much the expression of a gene is expected to change in normal samples in the conditions described in each PC. This is quite similar to the interpretation you are proposing for your analysis.

In general I explain that the PCA is a technique to reduce a dataset of multiple variables to fewer dimension (a data scaling technique). You can do the example of a PCA on 3 dimensions, in which you rotate a cube (3 dimension), to get the rotation in which the samples are better separated on the X and Y axis (2 dimensions). If your biologists are still doubtful I can give you many examples to explain PCA, from the eugeneticists using it to distinguish smart from stupid people depending on the characteristics of the skull (phrenology), to Cavalli-Sforza's analysis of population migrations.

Entering edit mode

Thank you for the ppt :)

Entering edit mode
6.2 years ago

Late in answering but I gave an explanation here: A: PCA in a RNA seq analysis


Entering edit mode
8.4 years ago
matted 7.8k

I don't think there's really anything wrong with how you're explaining principal components, but maybe some specific counterexamples would help change your labmate's mind.

Their interpretation of the principal component loadings is tempting, and in a perfect experimental world might even be true. A big counterexample to bring up is batch effects or other technical explanations. A frequent test in expression analysis (and other areas) is to plot the first few principal components of the data with the points colored by technical attributes, like sample processing date, machine platform, lab, or technician. Oftentimes there will be a clear clustering due to one of these factors (which isn't ideal and should be addressed). The loadings in this case would be something like genes that are able to be processed better on one instrument than another, maybe due to GC content (just to make up an example). Another related effect might occur due to the gender, cell type, or age of the samples (depending on the setup of the study). If gender is a big effect, then presumably the loadings for that component may contain a lot of sex-specific genes. These effects would be biologically meaningful, in a sense, but probably not what you're interested in studying.

The (good and bad) thing about PCA is that it discards any labels on the data (in your case, CNV-DEL, CNV-DUP, etc.). All the data points are treated the same, and the reconstruction error from the principal components doesn't depend on the label at all. Instead of an unsupervised technique like PCA, you might consider various supervised methods that use the sample labels to find groups of genes that significantly differ between classes. Standard differential expression analyses would be a good place to start here.

Overall, I think PCA as a first-pass sanity check where you hope your sample classes cluster separately is a reasonable thing to do. And if in fact you have batch effects to consider, it's a good investigating tool. However, I agree with you in being hesitant to assigning a lot of meaning to the particular values of the principal component loadings, since there are a lot of uninteresting scenarios that you can't exclude.


Login before adding your answer.

Traffic: 1256 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6