PCA in a RNA seq analysis
2
13
Entering edit mode
4.7 years ago
Mozart ▴ 320

Hi everyone, I am trying to understand a bit deeper what actually a PCA plot means in a more practical way. So please don't link me any kind of explanation...I am just trying to make myself clear on a personal exercise.

RNA-Seq pca • 24k views
2
Entering edit mode

Cold you rephrase your question please ? It is not clear what is the aim of your question..

0
Entering edit mode

Samples are not segregating according to the defined condition (A/B). MF1_S1 seems to be very different than whatever is going on with the rest of the bunch.

0
Entering edit mode
61
Entering edit mode
4.7 years ago

Hey, that is not a 'PCA plot' - it's a bi-plot comparing eigenvector (PC) 1 versus eigenvector (PC) 2. There will be more eigenvectors in your dataset at which you should additionally look. Each eigenvector will additionally have an associated eigenvalue, which alludes to its 'importance' (see below).

# Intro to PCA

PCA is a very powerful technique and can have much utility if you can get a grasp of it and what it means. It was initially developed to analyse large volumes of data in order to tease out the differences/relationships between the logical entities being analysed (for example, a data-set consisting of a large number of samples, each with their own data points/variables). It extracts the fundamental structure of the data without the need to build any model to represent it. This ‘summary’ of the data is arrived at through a process of reduction that can transform the large number of variables into a lesser number that are uncorrelated (i.e. the ‘principal components'), whilst at the same time being capable of easy interpretation on the original data.

[Source: my own manuscript: https://benthamopen.com/contents/pdf/TOBIOIJ/TOBIOIJ-7-19.pdf]

# The formulae and variance

The formulae to derive the eigenvectors and their associated eigenvalues are fundamentally based on variance (well, covariance). Thus, what PCA is summarising in your dataset is variance, or, better put, how your entities covary amongst each other. The eigenvectors are then ordered based on how much variation they explain in your dataset. PC1 / eigenvector 1 will always explain the most variation due to this ordering of the PCs. Thus, as to which genomax has correctly pointed in his comment above, the largest source of variation in your dataset is between MF1_S1 and the other samples, for whatever reasons we are not to know.

The variation explained by each eigenvector/PC is represented by a Scree plot. The explained variation of all PCs will sum to 100% - PCA will extract every ounce of variation that exists in your dataset

# What does this mean practically?

Practically, if you look at the numbers behind your PC1, you'll first notice that each gene/transcript/variable has been assigned a value... a weighting that allows us to infer its importance in relation to PC1, and, thus, its importance in relation to the source of variation between MF1_S1 and your other samples.

In your example, it looks like you're using DESeq2's in built function to build the bi-plot. Don't use that. Instead use the prcomp() function in R and then take a look at your eigenvectors, which will be stored in a variable called 'x', e.g., prcomp(MyData)$x # PCA is multi-dimensional Remember that PCA is much more than just this bi-plot that you've posted. PCA is multidimensional and, as mentioned, will extract every ounce of variation that exists in your dataset, which can be visualised by pairwise comparisons of each PC... The 'key' that you may be seeking could be hidden in these other PCs, but this depends greatly on your experimental set-up and what you are ultimately hoping to achieve by running whatever experiment it is that you're running. That's PCA explained to the general audience. ADD COMMENT 3 Entering edit mode Finally I found someone that we do not plot a proper PCA in RNASeq data as we see, mathematically it is multi-dimensional and what we do is a bi-plot to just see the components that capture the highest variability based on eigen-values. ADD REPLY 0 Entering edit mode Thanks for the comment chris! - grazie mille ADD REPLY 3 Entering edit mode If you look at the pairs plots, you will notice that there it seems to be no correlation between any of the components. That in fact is an important property of the components - there is no correlation between them. If you have a dataset containing multiple variables and some of these are correlated (ideally 0.8>r>0.3), you can use a PCA to reduce your data to eigenvectors that are not correlated with each other, which will allow you to use regression and other techniques that are confused by data where the variables are not independent from each other. ADD REPLY 0 Entering edit mode Good points, Giovanni! ADD REPLY 0 Entering edit mode Hi Kevin, Just a question about the PCA, So I have estimated my count with Kallisto, then I have transformed it to "scaledTPM" values, which are obtained by summing the transcript-level TPMs by gene, and multiplying them with the total library size in millions. (See this for more information Soneson C, Love M and Robinson M. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.) I have then used the package edgeR and normalize this "scaledTPM" values (TMM + cpm): y <- DGEList(scaledTPM) y <- calcNormFactors(y) my_data <- cpm(y, log=T)  Now for the PCA should I use the option scale or no ? pca <- prcomp(t( my_data), scale = T)  or pca <- prcomp(t( my_data), scale = F)  ADD REPLY 0 Entering edit mode Considering that you've logged the counts (?), too, my preference would be to leave prcomp at the default settings: prcomp(t(filteredExpr), center=TRUE, scale=FALSE)  If you were plotting just normalised unlogged counts, then scaling would be preferential! Edit: I have another thread here with more code: A: PCA plot from read count matrix from RNA-Seq ADD REPLY 0 Entering edit mode From what I understand, cpm() function returns counts per million reads. However, because I have used scaledTPMs instead of real counts, this is equivalent to the initial TPMs used to create the scaledTPMs. But these are additionally normalized by dividing each samples values by the TMM normalization factor ADD REPLY 2 Entering edit mode Just to clarify, TPM should not be used for any statistical analysis for DE. Rather you can use it in log-transformed for visualization purpose in your PCA. Your TPM is already a normalized data and you do not have to re-calculate the CPM from that. I would still use counts data, normalize it with TMM, plot the log-transformed CPM value for any outlier detection or understanding how my samples are distributed in orthogonal space if I make a bi-plot of 1st to components. TPM values should not be pushed into any analysis later on in the model. ADD REPLY 1 Entering edit mode Gets my upvote vchris - thanks ADD REPLY 0 Entering edit mode Ok but using the counts is not a problem ? because I have kallisto counts (which are not integers). ADD REPLY 0 Entering edit mode If you have used Kallisto then you also have the raw 'estimated' counts, which you can read in using manual scripts or, preferably, tximport ADD REPLY 0 Entering edit mode Thanks both for your answer. So if I understand: From kallisto results, I can do a PCA and DE analysis using these data: library(edgeR)  1) Import the count: txi <- tximport(files, type = "kallisto", tx2gene = transcript2gene) y <- DGEList(txi$counts)


2) TMM normalisation + log:

y <- calcNormFactors(y)
my_data <- cpm(y, log=T)


3) PCA (without scale):

pca <- prcomp(t(my_data), scale = F)

0
Entering edit mode

I don't see any link between txi and your other lines of code(?).

As vchris was saying, differetial expression anaysis is normally conducted on normalised counts, with plotting/visualisation purposes performed on logged or scaled normalised counts.

In DESeq2, I normally do it like this:

I am only familiar with Kallisto and DESeq2:

txi <- tximport(files, type="kallisto", txIn=TRUE, txOut=FALSE, tx2gene=TranscriptToGeneMap, countsFromAbundance="no")

#Filtering
txi.working$abundance <- txi.working$abundance[TranscriptFilterIndices,]
txi.working$counts <- txi.working$counts[TranscriptFilterIndices,]
txi.working$length <- txi.working$length[TranscriptFilterIndices,]

#Convert the data to DESeq format and specify the model
dds <- DESeqDataSetFromTximport(txi.working, colData=metadata.final, design =~ Batch + GENDER)

dds <- DESeq(dds)

#Normalise the data and output the normalised counts to disk
norm <- counts(dds, normalized=TRUE)

#Produce rlog-transformed and variance stabilised data, and conduct some diagnostics
rld <- rlogTransformation(dds, blind=TRUE)


Later, diff. expression is conducted using results() on the dds object, containing normalised counts. For plotting (and PCA), I use the regularised log counts in rld.

0
Entering edit mode

Sorry I forget the txi objet (I have edited it).

What is the normalization method in

norm <- counts(dds, normalized=TRUE)


?

2
Entering edit mode

DESeq2 uses a 'geometric' normalisation, which produces a negative binomial normalised count distrbution. EdgeR uses a different method.

I happened to just be reading specifically about the DESeq2 method whilst correcting a PhD student's thesis. https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-10-r106

0
Entering edit mode

Looks okay for doing PCA

0
Entering edit mode

https://benthamopen.com/contents/pdf/TOBIOIJ/TOBIOIJ-7-19.pdf] in this paper you explained which component to retain this is something i asked and you answered may be i could have just see this paper first ..

1
Entering edit mode

Yes, I convinced that journal to publish that for free... I published it from the English health system, which typically does not allocate money for research.

0
Entering edit mode

Dear @Kevin Is it possible to find the genes that best characterizing each group using PCA? I have 630 genes and 300 samples that are annotated in 4 cancer subtypes, using PCA, 4 groups are clear. I want to find genes that best characterize each subtype. Using PC1 loadings I could find the genes that have highest scores but how I can find which subtypes they belong to? Many thanks for your time and help!

0
Entering edit mode
19
Entering edit mode
3.6 years ago

(the following is a dramatization of the reality - but is generally true. You can think of it as the legend of the PCA)

For many centuries, the scientific community believed in what is written in the bible, which is that all humans are created equals, and there are no differences between people.

During the Enlightenment in the 19th century, thanks also to Darwin's book and other advances, people started thinking that maybe that was not the case. My brother is taller than me! Andrew's cranium is bigger than Louis'! Justin is mean! and so on.

Scientist started compiling lists of observations about human variability - e.g. cranium size, weight, height, and many others - this was how sciences like phrenology were born (http://www.victorianweb.org/science/phrenology/intro.html).

So, imagine these scientists with big tables of observations, like this:

subject height weight cranium_size eyebrow_type ...
1       170    69     40           single
2        175   89     43           double
….


However, if you think about it, working with this data presents some challenges.

First of all, there are many variables, so many dimensions. How can you even plot these? You may create a scatterplot for each combination of features, but you will end up with hundreds of plots. I am sure you may have been in a similar situation even if you are not a phrenologist: whenever you had to analyse a dataset with an high number of variables.

The second problem is that some of these features are correlated to each other. For example, height and weight may be correlated, as taller people tend to weight more. Many physical traits tend to be correlated. When your data contains correlated columns, you cannot really do regression or other types of analyses, because you are accounting for the same information twice.

So, the PCA was a technique developed for solving these issues. In a PCA, you take a dataset with an high number of variables, and you reduce it to two or a small number of variables (more precisely these are called components). Each of these new components will contain the information from all the original variables, but “flattened” in a way that each of these represents a portion of the variability in the original values. Moreover, these components are un-related with each other, which allows to apply regression and other techniques.

For each of these components, you will have a vector of “loadings” which tell you how much each of the original variable is represented in the component. For example, the first component may have high loadings for height and weight, which are two partially correlated variables, so they tend to contribute to variability in a similar way. You may even try to give a new meaning to the new components, e.g. you could say that this first component is the “biggerness” (a composite of height and weight) of a person. However, it is not always easy to understand the real meaning of a component.

I’ll leave you to other people to point out the mathematics of how to calculate a PCA, because there are plenty of articles out there; but in principle, this is what you are doing: you are reducing a dataset of several dimensions to a smaller number of components.

EDIT - I forgot to explain how this relates to PCA in RNASeq data

In a RNASeq experiment, you can do a PCA on the expression levels of each gene. Instead of having a table with subject id, height, weight, and so on, you will have a table where each column is the expression of a gene, and each row represents a sample. You may expect that some genes have correlated expression. In that case, these genes will probably have similar loadings in the same component.

I like this paper: https://www.nature.com/articles/ng.3173/ , where they have applied PCA to a large number of expression datasets from public sources, and did a PCA on them. They have obtained a few hundreds Principal Components for each dataset (e.g 777 components for the large human dataset). Each of these components has loadings for each of the genes in the original data. They have taken these vectors of loadings, and executed an pathway enrichment on each. The purpose of this enrichment is to try to describe what each of these components mean; for example, the third component represents the status of gene expression in the brain, because many genes known to be in pathways related to brain development have high loadings in that component.

2
Entering edit mode

1
Entering edit mode

Thanks to both of you guys for the effort, very helpful indeed!

3
Entering edit mode

...and here is the proof of the non-correlated nature of the PCs:

The PCs have absolutely no correlation to each other, i.e., Pearson = 0.

1
Entering edit mode

the thread is getting better and better in context of PCA