Question: How to perform PCA on gene expression?
gravatar for Thomas B.
13 months ago by
Thomas B.10
Thomas B.10 wrote:


I am interested in the identification of marker genes of infection by a parasite. I have several biological samples corresponding to different infection stages. I quantified the expression of genes that are though to be differentially expressed by qPCR. Now I would like to run a principal component analysis (PCA) to (1) cluster the samples based on gene expression and (2) identify which of the genes contribute the most to clustering.

From the qPCR experiment, I got data under three forms: Cq values, relative quantities and normalized expression against reference genes.

I read tutorials but most of them are focused on RNAseq data analysis, not qPCR. Some of my questions remain unanswered and I really hope to get some more explanations here.

  • Which data is it better to use for the PCA? Intuitively I would use normalized expression because the variance between samples has been taken into account. But I can see, for example in the HTqPCR package, that Cq values are rather plotted.

  • Should I apply a log transformation such as log(N+1)? My data have a logarithmic distribution. I read that normality is not an assumption of PCA; but the closer the data are to a normal distribution, the better the PCA performs.

  • Should I scale the data? The prcomp function in R (for example) offers this possibility. But I guess this is related to the type of data that is used. My feeling is that this is useless providing I use normalised gene expression, for the reason I provided in my first question.

I tried several combinations and they did not yield in the same output. I would like to find the proper, objective way - I don't want to just pick the one that suits best to my expectations.

ADD COMMENTlink modified 13 months ago by ahmad mousavi450 • written 13 months ago by Thomas B.10
gravatar for Kevin Blighe
13 months ago by
Kevin Blighe52k
Kevin Blighe52k wrote:

My preference for input to PCA is definitely normalised data that follows a Gaussian / binomial distribution. In RNA-seq, normalised data follows a negative binomial distribution, so, it 'requires' a transformation to logged (e.g. logCPM (EdgeR) or regularised log (DESeq2)) data prior to running PCA. The additional scaling step can still be used, as it helps to 'iron out' any extra creases in the data.

As you have qPCR data, you should check its distribution and then make a decision. Logging it is certainly feasible. If you log it and additionally convert it to Z-scores outside of PCA, then you should switch off the further scaling step in prcomp().

For checking the the relationship of genes to each PC, you need to look at the rotation object that is returned by prcomp(), e.g.,

pca <- prcomp(t(x))


ADD COMMENTlink written 13 months ago by Kevin Blighe52k
gravatar for ahmad mousavi
13 months ago by
ahmad mousavi450
Royan Institute, Tehran, Iran
ahmad mousavi450 wrote:

Hi you could try following on your raw reads count file:

# gr is group name of your samples based on columns in counts matrix.
#pcr<- data.frame(pc$r[,1:2])
pcr <- data.frame(pc$r[,1:2], Group=gr)
ggplot(pcr, aes(PC1, PC2, color=Group))+geom_point(size=7)+theme_bw()
ADD COMMENTlink written 13 months ago by ahmad mousavi450
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 785 users visited in the last hour