Hey, that is not a 'PCA plot' - it's a bi-plot comparing eigenvector (PC) 1 versus eigenvector (PC) 2. There will be more eigenvectors in your dataset at which you should additionally look. Each eigenvector will additionally have an associated eigenvalue, which alludes to its 'importance' (see below).
Intro to PCA
PCA is a very powerful technique and can have much utility if you can get a grasp of it and what it means. It was initially developed to analyse large volumes of data in order to tease out the differences/relationships between the logical entities being analysed (for example, a data-set consisting of a large number of samples, each with their own data points/variables). It extracts the fundamental structure of the data without the need to build any model to represent it. This ‘summary’ of the data is arrived at through a process of reduction that can transform the large number of variables into a lesser number that are uncorrelated (i.e. the ‘principal components'), whilst at the same time being capable of easy interpretation on the original data.
[Source: my own crumby manuscript: https://benthamopen.com/contents/pdf/TOBIOIJ/TOBIOIJ-7-19.pdf]
The formulae and variance
The formulae to derive the eigenvectors and their associated eigenvalues are fundamentally based on variance (well, covariance). Thus, what PCA is summarising in your dataset is variance, or, better put, how your entities covary amongst each other. The eigenvectors are then ordered based on how much variation they explain in your dataset. PC1 / eigenvector 1 will always explain the most variation due to this ordering of the PCs. Thus, as to which genomax has correctly pointed in his comment above, the largest source of variation in your dataset is between MF1_S1 and the other samples, for whatever reasons we are not to know.
The variation explained by each eigenvector/PC is represented by a Scree plot. The explained variation of all PCs will sum to 100% - PCA will extract every ounce of variation that exists in your dataset
What does this mean practically?
Practically, if you look at the numbers behind your PC1, you'll first notice that each gene/transcript/variable has been assigned a value... a weighting that allows us to infer its importance in relation to PC1, and, thus, its importance in relation to the source of variation between MF1_S1 and your other samples.
In your example, it looks like you're using DESeq2's in built function to build the bi-plot. Don't use that. Instead use the
prcomp() function in R and then take a look at your eigenvectors, which will be stored in a variable called 'x', e.g.,
PCA is multi-dimensional
Remember that PCA is much more than just this bi-plot that you've posted. PCA is multidimensional and, as mentioned, will extract every ounce of variation that exists in your dataset, which can be visualised by pairwise comparisons of each PC...
The 'key' that you may be seeking could be hidden in these other PCs, but this depends greatly on your experimental set-up and what you are ultimately hoping to achieve by running whatever experiment it is that you're running.
That's PCA explained to the general audience.
(the following is a dramatization of the reality - but is generally true. You can think of it as the legend of the PCA)
For many centuries, the scientific community believed in what is written in the bible, which is that all humans are created equals, and there are no differences between people.
During the Enlightenment in the 19th century, thanks also to Darwin's book and other advances, people started thinking that maybe that was not the case. My brother is taller than me! Andrew's cranium is bigger than Louis'! Justin is mean! and so on.
Scientist started compiling lists of observations about human variability - e.g. cranium size, weight, height, and many others - this was how sciences like phrenology were born (http://www.victorianweb.org/science/phrenology/intro.html).
So, imagine these scientists with big tables of observations, like this:
subject height weight cranium_size eyebrow_type ... 1 170 69 40 single 2 175 89 43 double ….
However, if you think about it, working with this data presents some challenges.
First of all, there are many variables, so many dimensions. How can you even plot these? You may create a scatterplot for each combination of features, but you will end up with hundreds of plots. I am sure you may have been in a similar situation even if you are not a phrenologist: whenever you had to analyse a dataset with an high number of variables.
The second problem is that some of these features are correlated to each other. For example, height and weight may be correlated, as taller people tend to weight more. Many physical traits tend to be correlated. When your data contains correlated columns, you cannot really do regression or other types of analyses, because you are accounting for the same information twice.
So, the PCA was a technique developed for solving these issues. In a PCA, you take a dataset with an high number of variables, and you reduce it to two or a small number of variables (more precisely these are called components). Each of these new components will contain the information from all the original variables, but “flattened” in a way that each of these represents a portion of the variability in the original values. Moreover, these components are un-related with each other, which allows to apply regression and other techniques.
For each of these components, you will have a vector of “loadings” which tell you how much each of the original variable is represented in the component. For example, the first component may have high loadings for height and weight, which are two partially correlated variables, so they tend to contribute to variability in a similar way. You may even try to give a new meaning to the new components, e.g. you could say that this first component is the “biggerness” (a composite of height and weight) of a person. However, it is not always easy to understand the real meaning of a component.
I’ll leave you to other people to point out the mathematics of how to calculate a PCA, because there are plenty of articles out there; but in principle, this is what you are doing: you are reducing a dataset of several dimensions to a smaller number of components.
EDIT - I forgot to explain how this relates to PCA in RNASeq data
In a RNASeq experiment, you can do a PCA on the expression levels of each gene. Instead of having a table with subject id, height, weight, and so on, you will have a table where each column is the expression of a gene, and each row represents a sample. You may expect that some genes have correlated expression. In that case, these genes will probably have similar loadings in the same component.
I like this paper: https://www.nature.com/articles/ng.3173/ , where they have applied PCA to a large number of expression datasets from public sources, and did a PCA on them. They have obtained a few hundreds Principal Components for each dataset (e.g 777 components for the large human dataset). Each of these components has loadings for each of the genes in the original data. They have taken these vectors of loadings, and executed an pathway enrichment on each. The purpose of this enrichment is to try to describe what each of these components mean; for example, the third component represents the status of gene expression in the brain, because many genes known to be in pathways related to brain development have high loadings in that component.