**40**wrote:

Hi everyone, I am trying to understand a bit deeper what actually a PCA plot means in a more practical way. So please don't link me any kind of explanation...I am just trying to make myself clear on a personal exercise.

Question: PCA in a RNA seq analysis

1

Mozart • **40** wrote:

Hi everyone, I am trying to understand a bit deeper what actually a PCA plot means in a more practical way. So please don't link me any kind of explanation...I am just trying to make myself clear on a personal exercise.

16

Kevin Blighe ♦ **17k** wrote:

Apologies, I couldn't resist as PCA is a strong interest of mine.

Your terminology is incorrect. That is not a '*PCA plot*' - it's a bi-plot comparing the eigenvalues of eigenvector (PC) 1 versus those of eigenvector (PC) 2. There will be more eigenvectors in your dataset at which you should additionally look.

PCA is a very powerful technique and can have much utility if you can get a grasp of it and what it means. It was initially developed to analyse large volumes of data in order to tease out the differences/relationships between the logical entities being analysed (for example, a data-set consisting of a large number of samples, each with their own data points/variables). **It extracts the fundamental structure of the data without the need to build any model to represent it.** This ‘summary’ of the data is arrived at through a process of reduction that transforms the large number of variables into a lesser number that are uncorrelated (i.e. the ‘principal components'), whilst at the same time being capable of easy interpretation on the original data.

*[Source: my own crumby manuscript: https://benthamopen.com/contents/pdf/TOBIOIJ/TOBIOIJ-7-19.pdf]*

The formulae to derive the eigenvectors and their associated eigenvalues are fundamentally based on variance. Thus, **what PCA is summarising in your dataset is variance**. The eigenvectors are then ordered based on how much variance they explain in your dataset. PC1 / eigenvector 1 will always explain the most variance due to this ordering of the PCs. Thus, as to which genomax has correctly pointed in his comment above, the largest source of variation in your dataset is between MF1_S1 and the other samples, for whatever reasons we are not to know.

The variance explained by each eigenvector/PC is represented by a Scree plot. **The variance of all PCs will sum to 100% - PCA will extract every ounce of variation that exists in your dataset**

Practically, if you look at the derived eigenvalues for your PC1, you'll first notice that each gene/transcript/variable has been assigned a value... a weighting that allows us to infer the gene's importance in relation to PC1, and, thus, its importance in relation to the source of variation between MF1_S1 and your other samples.

In your example, it looks like you're using DESeq2's in built function to build the bi-plot. Don't use that. Instead use the `prcomp()`

function in R and then take a look at your eigenvectors and eigenvalues, which will be stored in a variable called '*x*', e.g., `prcomp(MyData)$x`

Remember that PCA is much more than just this bi-plot that you've posted. PCA is multidimensional and, as mentioned, will extract every ounce of variation that exists in your dataset, which can be visualised by pairwise comparisons of each PC...

The 'key' that you may be seeking could be hidden in these other PCs, but this depends greatly on your experimental set-up and what you are ultimately hoping to achieve by running whatever experiment it is that you're running.

That's PCA explained to the general audience.

2

Finally I found someone that we do not plot a proper PCA in RNASeq data as we see, mathematically it is multi-dimensional and what we do is a bi-plot to just see the components that capture the highest variability based on eigen-values.

Hi Kevin,

Just a question about the PCA,

So I have estimated my count with Kallisto, then I have transformed it to "scaledTPM" values, which are obtained by summing the transcript-level TPMs by gene, and multiplying them with the total library size in millions.

(See this for more information Soneson C, Love M and Robinson M. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.)

I have then used the package `edgeR`

and normalize this "scaledTPM" values (TMM + cpm):

```
y <- DGEList(scaledTPM)
y <- calcNormFactors(y)
my_data <- cpm(y, log=T)
```

Now for the PCA should I use the option scale or no ?

```
pca <- prcomp(t( my_data), scale = T)
```

or

```
pca <- prcomp(t( my_data), scale = F)
```

Considering that you've logged the counts (?), too, my preference would be to leave prcomp at the default settings:

```
prcomp(t(filteredExpr), center=TRUE, scale=FALSE)
```

If you were plotting just normalised unlogged counts, then scaling would be preferential!

Edit: I have another thread here with more code: A: PCA plot from read count matrix from RNA-Seq

From what I understand, `cpm()`

function returns counts per million reads.

However, because I have used scaledTPMs instead of real counts, this is equivalent to the initial TPMs used to create the scaledTPMs. But these are additionally normalized by dividing each samples values by the TMM normalization factor

2

Just to clarify, TPM should not be used for any statistical analysis for DE. Rather you can use it in log-transformed for visualization purpose in your PCA. Your TPM is already a normalized data and you do not have to re-calculate the CPM from that. I would still use counts data, normalize it with TMM, plot the log-transformed CPM value for any outlier detection or understanding how my samples are distributed in orthogonal space if I make a bi-plot of 1st to components. TPM values should not be pushed into any analysis later on in the model.

Ok but using the counts is not a problem ? because I have kallisto counts (which are not integers).

If you have used Kallisto then you also have the raw 'estimated' counts, which you can read in using manual scripts or, preferably, tximport

Thanks both for your answer. So if I understand:

From kallisto results, I can do a PCA and DE analysis using these data:

```
library(edgeR)
```

1) Import the count:

```
txi <- tximport(files, type = "kallisto", tx2gene = transcript2gene)
y <- DGEList(txi$counts)
```

2) TMM normalisation + log:

```
y <- calcNormFactors(y)
my_data <- cpm(y, log=T)
```

3) PCA (without scale):

```
pca <- prcomp(t(my_data), scale = F)
```

I don't see any link between txi and your other lines of code(?).

As vchris was saying, differetial expression anaysis is normally conducted on normalised counts, with plotting/visualisation purposes performed on logged or scaled normalised counts.

In DESeq2, I normally do it like this:

I am only familiar with Kallisto and DESeq2:

```
txi <- tximport(files, type="kallisto", txIn=TRUE, txOut=FALSE, tx2gene=TranscriptToGeneMap, countsFromAbundance="no")
#Filtering
txi.working$abundance <- txi.working$abundance[TranscriptFilterIndices,]
txi.working$counts <- txi.working$counts[TranscriptFilterIndices,]
txi.working$length <- txi.working$length[TranscriptFilterIndices,]
#Convert the data to DESeq format and specify the model
dds <- DESeqDataSetFromTximport(txi.working, colData=metadata.final, design =~ Batch + GENDER)
dds <- DESeq(dds)
#Normalise the data and output the normalised counts to disk
norm <- counts(dds, normalized=TRUE)
#Produce rlog-transformed and variance stabilised data, and conduct some diagnostics
rld <- rlogTransformation(dds, blind=TRUE)
```

Later, diff. expression is conducted using `results()`

on the dds object, containing normalised counts. For plotting (and PCA), I use the regularised log counts in rld.

Sorry I forget the txi objet (I have edited it).

What is the normalization method in

```
norm <- counts(dds, normalized=TRUE)
```

?

1

DESeq2 uses a 'geometric' normalisation, which produces a negative binomial normalised count distrbution. EdgeR uses a different method.

I happened to just be reading specifically about the DESeq2 method whilst correcting a PhD student's thesis. https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-10-r106

Please log in to add an answer.

Use of this site constitutes acceptance of our User
Agreement
and Privacy
Policy.

Powered by Biostar
version 2.3.0

Traffic: 978 users visited in the last hour

Cold you rephrase your question please ? It is not clear what is the aim of your question..

6.2kSamples are not segregating according to the defined condition (A/B). MF1_S1 seems to be very different than whatever is going on with the rest of the bunch.

46khttps://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

17k