Question: PCA in a RNA seq analysis
gravatar for Mozart
11 months ago by
Mozart50 wrote:

Hi everyone, I am trying to understand a bit deeper what actually a PCA plot means in a more practical way. So please don't link me any kind of explanation...I am just trying to make myself clear on a personal exercise.

rna-seq pca • 4.1k views
ADD COMMENTlink modified 7 months ago • written 11 months ago by Mozart50

Cold you rephrase your question please ? It is not clear what is the aim of your question..

ADD REPLYlink written 11 months ago by Nicolas Rosewick6.7k

Samples are not segregating according to the defined condition (A/B). MF1_S1 seems to be very different than whatever is going on with the rest of the bunch.

ADD REPLYlink written 11 months ago by genomax57k
gravatar for Kevin Blighe
11 months ago by
Kevin Blighe30k
Republic of Ireland
Kevin Blighe30k wrote:

Apologies, I couldn't resist as PCA is a strong interest of mine.

Your terminology is incorrect. That is not a 'PCA plot' - it's a bi-plot comparing the eigenvalues of eigenvector (PC) 1 versus those of eigenvector (PC) 2. There will be more eigenvectors in your dataset at which you should additionally look.

Intro to PCA

PCA is a very powerful technique and can have much utility if you can get a grasp of it and what it means. It was initially developed to analyse large volumes of data in order to tease out the differences/relationships between the logical entities being analysed (for example, a data-set consisting of a large number of samples, each with their own data points/variables). It extracts the fundamental structure of the data without the need to build any model to represent it. This ‘summary’ of the data is arrived at through a process of reduction that transforms the large number of variables into a lesser number that are uncorrelated (i.e. the ‘principal components'), whilst at the same time being capable of easy interpretation on the original data.

[Source: my own crumby manuscript:]

The formulae and variance

The formulae to derive the eigenvectors and their associated eigenvalues are fundamentally based on variance. Thus, what PCA is summarising in your dataset is variance. The eigenvectors are then ordered based on how much variance they explain in your dataset. PC1 / eigenvector 1 will always explain the most variance due to this ordering of the PCs. Thus, as to which genomax has correctly pointed in his comment above, the largest source of variation in your dataset is between MF1_S1 and the other samples, for whatever reasons we are not to know.

The variance explained by each eigenvector/PC is represented by a Scree plot. The variance of all PCs will sum to 100% - PCA will extract every ounce of variation that exists in your dataset


What does this mean practically?

Practically, if you look at the derived eigenvalues for your PC1, you'll first notice that each gene/transcript/variable has been assigned a value... a weighting that allows us to infer the gene's importance in relation to PC1, and, thus, its importance in relation to the source of variation between MF1_S1 and your other samples.

In your example, it looks like you're using DESeq2's in built function to build the bi-plot. Don't use that. Instead use the prcomp() function in R and then take a look at your eigenvectors and eigenvalues, which will be stored in a variable called 'x', e.g., prcomp(MyData)$x

PCA is multi-dimensional

Remember that PCA is much more than just this bi-plot that you've posted. PCA is multidimensional and, as mentioned, will extract every ounce of variation that exists in your dataset, which can be visualised by pairwise comparisons of each PC...



The 'key' that you may be seeking could be hidden in these other PCs, but this depends greatly on your experimental set-up and what you are ultimately hoping to achieve by running whatever experiment it is that you're running.

That's PCA explained to the general audience.

ADD COMMENTlink modified 24 days ago by RamRS18k • written 11 months ago by Kevin Blighe30k

Finally I found someone that we do not plot a proper PCA in RNASeq data as we see, mathematically it is multi-dimensional and what we do is a bi-plot to just see the components that capture the highest variability based on eigen-values.

ADD REPLYlink modified 11 months ago • written 11 months ago by vchris_ngs4.5k

Thanks for the comment chris! - grazie mille

ADD REPLYlink written 11 months ago by Kevin Blighe30k

Hi Kevin,

Just a question about the PCA,

So I have estimated my count with Kallisto, then I have transformed it to "scaledTPM" values, which are obtained by summing the transcript-level TPMs by gene, and multiplying them with the total library size in millions.

(See this for more information Soneson C, Love M and Robinson M. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.)

I have then used the package edgeR and normalize this "scaledTPM" values (TMM + cpm):

y <- DGEList(scaledTPM) 
y <- calcNormFactors(y)
my_data <- cpm(y, log=T)

Now for the PCA should I use the option scale or no ?

pca <- prcomp(t( my_data), scale = T)


pca <- prcomp(t( my_data), scale = F)
ADD REPLYlink modified 11 months ago • written 11 months ago by Picasa380

Considering that you've logged the counts (?), too, my preference would be to leave prcomp at the default settings:

prcomp(t(filteredExpr), center=TRUE, scale=FALSE)

If you were plotting just normalised unlogged counts, then scaling would be preferential!

Edit: I have another thread here with more code: A: PCA plot from read count matrix from RNA-Seq

ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe30k

From what I understand, cpm() function returns counts per million reads.

However, because I have used scaledTPMs instead of real counts, this is equivalent to the initial TPMs used to create the scaledTPMs. But these are additionally normalized by dividing each samples values by the TMM normalization factor

ADD REPLYlink written 11 months ago by Picasa380

Just to clarify, TPM should not be used for any statistical analysis for DE. Rather you can use it in log-transformed for visualization purpose in your PCA. Your TPM is already a normalized data and you do not have to re-calculate the CPM from that. I would still use counts data, normalize it with TMM, plot the log-transformed CPM value for any outlier detection or understanding how my samples are distributed in orthogonal space if I make a bi-plot of 1st to components. TPM values should not be pushed into any analysis later on in the model.

ADD REPLYlink written 11 months ago by vchris_ngs4.5k

Gets my upvote vchris - thanks

ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe30k

Ok but using the counts is not a problem ? because I have kallisto counts (which are not integers).

ADD REPLYlink written 11 months ago by Picasa380

If you have used Kallisto then you also have the raw 'estimated' counts, which you can read in using manual scripts or, preferably, tximport

ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe30k

Thanks both for your answer. So if I understand:

From kallisto results, I can do a PCA and DE analysis using these data:


1) Import the count:

txi <- tximport(files, type = "kallisto", tx2gene = transcript2gene)
y <- DGEList(txi$counts)

2) TMM normalisation + log:

y <- calcNormFactors(y)
my_data <- cpm(y, log=T)

3) PCA (without scale):

pca <- prcomp(t(my_data), scale = F)
ADD REPLYlink modified 11 months ago • written 11 months ago by Picasa380

I don't see any link between txi and your other lines of code(?).

As vchris was saying, differetial expression anaysis is normally conducted on normalised counts, with plotting/visualisation purposes performed on logged or scaled normalised counts.

In DESeq2, I normally do it like this:

I am only familiar with Kallisto and DESeq2:

txi <- tximport(files, type="kallisto", txIn=TRUE, txOut=FALSE, tx2gene=TranscriptToGeneMap, countsFromAbundance="no")

txi.working$abundance <- txi.working$abundance[TranscriptFilterIndices,]
txi.working$counts <- txi.working$counts[TranscriptFilterIndices,]
txi.working$length <- txi.working$length[TranscriptFilterIndices,]

#Convert the data to DESeq format and specify the model
dds <- DESeqDataSetFromTximport(txi.working,, design =~ Batch + GENDER)

dds <- DESeq(dds)

#Normalise the data and output the normalised counts to disk
norm <- counts(dds, normalized=TRUE)

#Produce rlog-transformed and variance stabilised data, and conduct some diagnostics
rld <- rlogTransformation(dds, blind=TRUE)

Later, diff. expression is conducted using results() on the dds object, containing normalised counts. For plotting (and PCA), I use the regularised log counts in rld.

ADD REPLYlink written 11 months ago by Kevin Blighe30k

Sorry I forget the txi objet (I have edited it).

What is the normalization method in

norm <- counts(dds, normalized=TRUE)


ADD REPLYlink modified 11 months ago • written 11 months ago by Picasa380

DESeq2 uses a 'geometric' normalisation, which produces a negative binomial normalised count distrbution. EdgeR uses a different method.

I happened to just be reading specifically about the DESeq2 method whilst correcting a PhD student's thesis.

ADD REPLYlink written 11 months ago by Kevin Blighe30k

Looks okay for doing PCA

ADD REPLYlink written 11 months ago by Kevin Blighe30k] in this paper you explained which component to retain this is something i asked and you answered may be i could have just see this paper first ..

ADD REPLYlink written 3 months ago by krushnach80390

Yes, I convinced that journal to publish that for free... I published it from the English health system, which typically does not allocate money for research.

ADD REPLYlink written 3 months ago by Kevin Blighe30k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1222 users visited in the last hour