Question: PCA in a RNA seq analysis
4
gravatar for Mozart
13 months ago by
Mozart70
Mozart70 wrote:

Hi everyone, I am trying to understand a bit deeper what actually a PCA plot means in a more practical way. So please don't link me any kind of explanation...I am just trying to make myself clear on a personal exercise.

rna-seq pca • 5.5k views
ADD COMMENTlink modified 23 days ago by Giovanni M Dall'Olio26k • written 13 months ago by Mozart70
2

Cold you rephrase your question please ? It is not clear what is the aim of your question..

ADD REPLYlink written 13 months ago by Nicolas Rosewick7.0k

Samples are not segregating according to the defined condition (A/B). MF1_S1 seems to be very different than whatever is going on with the rest of the bunch.

ADD REPLYlink written 13 months ago by genomax59k
35
gravatar for Kevin Blighe
13 months ago by
Kevin Blighe33k
Republic of Ireland
Kevin Blighe33k wrote:

Your terminology is incorrect. That is not a 'PCA plot' - it's a bi-plot comparing eigenvector (PC) 1 versus eigenvector (PC) 2. There will be more eigenvectors in your dataset at which you should additionally look. Each eigenvector will additionally have an associated eigenvalue, which alludes to its 'importance' (see below).

Intro to PCA

PCA is a very powerful technique and can have much utility if you can get a grasp of it and what it means. It was initially developed to analyse large volumes of data in order to tease out the differences/relationships between the logical entities being analysed (for example, a data-set consisting of a large number of samples, each with their own data points/variables). It extracts the fundamental structure of the data without the need to build any model to represent it. This ‘summary’ of the data is arrived at through a process of reduction that can transform the large number of variables into a lesser number that are uncorrelated (i.e. the ‘principal components'), whilst at the same time being capable of easy interpretation on the original data.

[Source: my own crumby manuscript: https://benthamopen.com/contents/pdf/TOBIOIJ/TOBIOIJ-7-19.pdf]

The formulae and variance

The formulae to derive the eigenvectors and their associated eigenvalues are fundamentally based on variance (well, covariance). Thus, what PCA is summarising in your dataset is variance, or, better put, how your entities covary amongst each other. The eigenvectors are then ordered based on how much variation they explain in your dataset. PC1 / eigenvector 1 will always explain the most variation due to this ordering of the PCs. Thus, as to which genomax has correctly pointed in his comment above, the largest source of variation in your dataset is between MF1_S1 and the other samples, for whatever reasons we are not to know.

The variation explained by each eigenvector/PC is represented by a Scree plot. The explained variation of all PCs will sum to 100% - PCA will extract every ounce of variation that exists in your dataset

scree

What does this mean practically?

Practically, if you look at the numbers behind your PC1, you'll first notice that each gene/transcript/variable has been assigned a value... a weighting that allows us to infer its importance in relation to PC1, and, thus, its importance in relation to the source of variation between MF1_S1 and your other samples.

In your example, it looks like you're using DESeq2's in built function to build the bi-plot. Don't use that. Instead use the prcomp() function in R and then take a look at your eigenvectors, which will be stored in a variable called 'x', e.g., prcomp(MyData)$x

PCA is multi-dimensional

Remember that PCA is much more than just this bi-plot that you've posted. PCA is multidimensional and, as mentioned, will extract every ounce of variation that exists in your dataset, which can be visualised by pairwise comparisons of each PC...

pairs1

pairs2

The 'key' that you may be seeking could be hidden in these other PCs, but this depends greatly on your experimental set-up and what you are ultimately hoping to achieve by running whatever experiment it is that you're running.


That's PCA explained to the general audience.

ADD COMMENTlink modified 28 days ago • written 13 months ago by Kevin Blighe33k
3

Finally I found someone that we do not plot a proper PCA in RNASeq data as we see, mathematically it is multi-dimensional and what we do is a bi-plot to just see the components that capture the highest variability based on eigen-values.

ADD REPLYlink modified 13 months ago • written 13 months ago by vchris_ngs4.6k

Thanks for the comment chris! - grazie mille

ADD REPLYlink written 13 months ago by Kevin Blighe33k
2

If you look at the pairs plots, you will notice that there it seems to be no correlation between any of the components. That in fact is an important property of the components - there is no correlation between them. If you have a dataset containing multiple variables and some of these are correlated (ideally 0.8>r>0.3), you can use a PCA to reduce your data to eigenvectors that are not correlated with each other, which will allow you to use regression and other techniques that are confused by data where the variables are not independent from each other.

ADD REPLYlink written 23 days ago by Giovanni M Dall'Olio26k

Good points, Giovanni!

ADD REPLYlink written 23 days ago by Kevin Blighe33k

Hi Kevin,

Just a question about the PCA,

So I have estimated my count with Kallisto, then I have transformed it to "scaledTPM" values, which are obtained by summing the transcript-level TPMs by gene, and multiplying them with the total library size in millions.

(See this for more information Soneson C, Love M and Robinson M. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.)

I have then used the package edgeR and normalize this "scaledTPM" values (TMM + cpm):

y <- DGEList(scaledTPM) 
y <- calcNormFactors(y)
my_data <- cpm(y, log=T)

Now for the PCA should I use the option scale or no ?

pca <- prcomp(t( my_data), scale = T)

or

pca <- prcomp(t( my_data), scale = F)
ADD REPLYlink modified 13 months ago • written 13 months ago by Picasa390

Considering that you've logged the counts (?), too, my preference would be to leave prcomp at the default settings:

prcomp(t(filteredExpr), center=TRUE, scale=FALSE)

If you were plotting just normalised unlogged counts, then scaling would be preferential!

Edit: I have another thread here with more code: A: PCA plot from read count matrix from RNA-Seq

ADD REPLYlink modified 13 months ago • written 13 months ago by Kevin Blighe33k

From what I understand, cpm() function returns counts per million reads.

However, because I have used scaledTPMs instead of real counts, this is equivalent to the initial TPMs used to create the scaledTPMs. But these are additionally normalized by dividing each samples values by the TMM normalization factor

ADD REPLYlink written 13 months ago by Picasa390
2

Just to clarify, TPM should not be used for any statistical analysis for DE. Rather you can use it in log-transformed for visualization purpose in your PCA. Your TPM is already a normalized data and you do not have to re-calculate the CPM from that. I would still use counts data, normalize it with TMM, plot the log-transformed CPM value for any outlier detection or understanding how my samples are distributed in orthogonal space if I make a bi-plot of 1st to components. TPM values should not be pushed into any analysis later on in the model.

ADD REPLYlink written 13 months ago by vchris_ngs4.6k
1

Gets my upvote vchris - thanks

ADD REPLYlink modified 13 months ago • written 13 months ago by Kevin Blighe33k

Ok but using the counts is not a problem ? because I have kallisto counts (which are not integers).

ADD REPLYlink written 13 months ago by Picasa390

If you have used Kallisto then you also have the raw 'estimated' counts, which you can read in using manual scripts or, preferably, tximport

ADD REPLYlink modified 13 months ago • written 13 months ago by Kevin Blighe33k

Thanks both for your answer. So if I understand:

From kallisto results, I can do a PCA and DE analysis using these data:

library(edgeR)

1) Import the count:

txi <- tximport(files, type = "kallisto", tx2gene = transcript2gene)
y <- DGEList(txi$counts)

2) TMM normalisation + log:

y <- calcNormFactors(y)
my_data <- cpm(y, log=T)

3) PCA (without scale):

pca <- prcomp(t(my_data), scale = F)
ADD REPLYlink modified 13 months ago • written 13 months ago by Picasa390

I don't see any link between txi and your other lines of code(?).

As vchris was saying, differetial expression anaysis is normally conducted on normalised counts, with plotting/visualisation purposes performed on logged or scaled normalised counts.

In DESeq2, I normally do it like this:

I am only familiar with Kallisto and DESeq2:

txi <- tximport(files, type="kallisto", txIn=TRUE, txOut=FALSE, tx2gene=TranscriptToGeneMap, countsFromAbundance="no")

#Filtering
txi.working$abundance <- txi.working$abundance[TranscriptFilterIndices,]
txi.working$counts <- txi.working$counts[TranscriptFilterIndices,]
txi.working$length <- txi.working$length[TranscriptFilterIndices,]

#Convert the data to DESeq format and specify the model
dds <- DESeqDataSetFromTximport(txi.working, colData=metadata.final, design =~ Batch + GENDER)

dds <- DESeq(dds)

#Normalise the data and output the normalised counts to disk
norm <- counts(dds, normalized=TRUE)

#Produce rlog-transformed and variance stabilised data, and conduct some diagnostics
rld <- rlogTransformation(dds, blind=TRUE)

Later, diff. expression is conducted using results() on the dds object, containing normalised counts. For plotting (and PCA), I use the regularised log counts in rld.

ADD REPLYlink written 13 months ago by Kevin Blighe33k

Sorry I forget the txi objet (I have edited it).

What is the normalization method in

norm <- counts(dds, normalized=TRUE)

?

ADD REPLYlink modified 13 months ago • written 13 months ago by Picasa390
2

DESeq2 uses a 'geometric' normalisation, which produces a negative binomial normalised count distrbution. EdgeR uses a different method.

I happened to just be reading specifically about the DESeq2 method whilst correcting a PhD student's thesis. https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-10-r106

ADD REPLYlink written 13 months ago by Kevin Blighe33k

Looks okay for doing PCA

ADD REPLYlink written 13 months ago by Kevin Blighe33k

https://benthamopen.com/contents/pdf/TOBIOIJ/TOBIOIJ-7-19.pdf] in this paper you explained which component to retain this is something i asked and you answered may be i could have just see this paper first ..

ADD REPLYlink written 5 months ago by krushnach80440
1

Yes, I convinced that journal to publish that for free... I published it from the English health system, which typically does not allocate money for research.

ADD REPLYlink written 5 months ago by Kevin Blighe33k
9
gravatar for Giovanni M Dall'Olio
23 days ago by
London, UK
Giovanni M Dall'Olio26k wrote:

(the following is a dramatization of the reality - but is generally true. You can think of it as the legend of the PCA)

For many centuries, the scientific community believed in what is written in the bible, which is that all humans are created equals, and there are no differences between people.

During the Enlightenment in the 19th century, thanks also to Darwin's book and other advances, people started thinking that maybe that was not the case. My brother is taller than me! Andrew's cranium is bigger than Louis'! Justin is mean! and so on.

Scientist started compiling lists of observations about human variability - e.g. cranium size, weight, height, and many others - this was how sciences like phrenology were born (http://www.victorianweb.org/science/phrenology/intro.html).

So, imagine these scientists with big tables of observations, like this:

subject height weight cranium_size eyebrow_type ...
1       170    69     40           single
2        175   89     43           double
….

However, if you think about it, working with this data presents some challenges.

First of all, there are many variables, so many dimensions. How can you even plot these? You may create a scatterplot for each combination of features, but you will end up with hundreds of plots. I am sure you may have been in a similar situation even if you are not a phrenologist: whenever you had to analyse a dataset with an high number of variables.

The second problem is that some of these features are correlated to each other. For example, height and weight may be correlated, as taller people tend to weight more. Many physical traits tend to be correlated. When your data contains correlated columns, you cannot really do regression or other types of analyses, because you are accounting for the same information twice.

So, the PCA was a technique developed for solving these issues. In a PCA, you take a dataset with an high number of variables, and you reduce it to two or a small number of variables (more precisely these are called components). Each of these new components will contain the information from all the original variables, but “flattened” in a way that each of these represents a portion of the variability in the original values. Moreover, these components are un-related with each other, which allows to apply regression and other techniques.

For each of these components, you will have a vector of “loadings” which tell you how much each of the original variable is represented in the component. For example, the first component may have high loadings for height and weight, which are two partially correlated variables, so they tend to contribute to variability in a similar way. You may even try to give a new meaning to the new components, e.g. you could say that this first component is the “biggerness” (a composite of height and weight) of a person. However, it is not always easy to understand the real meaning of a component.

I’ll leave you to other people to point out the mathematics of how to calculate a PCA, because there are plenty of articles out there; but in principle, this is what you are doing: you are reducing a dataset of several dimensions to a smaller number of components.

EDIT - I forgot to explain how this relates to PCA in RNASeq data

In a RNASeq experiment, you can do a PCA on the expression levels of each gene. Instead of having a table with subject id, height, weight, and so on, you will have a table where each column is the expression of a gene, and each row represents a sample. You may expect that some genes have correlated expression. In that case, these genes will probably have similar loadings in the same component.

I like this paper: https://www.nature.com/articles/ng.3173/ , where they have applied PCA to a large number of expression datasets from public sources, and did a PCA on them. They have obtained a few hundreds Principal Components for each dataset (e.g 777 components for the large human dataset). Each of these components has loadings for each of the genes in the original data. They have taken these vectors of loadings, and executed an pathway enrichment on each. The purpose of this enrichment is to try to describe what each of these components mean; for example, the third component represents the status of gene expression in the brain, because many genes known to be in pathways related to brain development have high loadings in that component.

enter image description here

ADD COMMENTlink modified 23 days ago • written 23 days ago by Giovanni M Dall'Olio26k
2

Very good - especially the part on the component loadings, on which I had not focused in my answer.

ADD REPLYlink written 23 days ago by Kevin Blighe33k
1

Thanks to both of you guys for the effort, very helpful indeed!

ADD REPLYlink written 23 days ago by ATpoint11k
3

...and here is the proof of the non-correlated nature of the PCs:

d

The PCs have absolutely no correlation to each other, i.e., Pearson = 0.

ADD REPLYlink written 21 days ago by Kevin Blighe33k
1

the thread is getting better and better in context of PCA

ADD REPLYlink written 20 days ago by krushnach80440
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 854 users visited in the last hour