Question: PCA plot from read count matrix from RNA-Seq
3
gravatar for rachel.kubik12
17 months ago by
rachel.kubik1260 wrote:

Hello!

I have RNA-Seq data with 32 samples. I used Salmon to generate a read count matrix. I want to generate a PCA plot to look at the relationship between my samples. How do I use the read count matrix to perform the PCA analysis?

Thanks!

rna-seq pca • 10k views
ADD COMMENTlink modified 13 months ago by a.archana0 • written 17 months ago by rachel.kubik1260
38
gravatar for Kevin Blighe
17 months ago by
Kevin Blighe41k
London, England
Kevin Blighe41k wrote:

NB - this is now a Bioconductor R package: https://github.com/kevinblighe/PCAtools

-------------------------

You should normalise your data prior to performing PCA. In the code below, you'll have to add plot legends yourself, and also colour vectors (passed to the 'col' parameter).

Then, assuming that you have transcripts as rows and samples as columns:

NB - in this code, the plots I've shown don't necessarily match the exact code, but the plot type is the same

[Edit: also take a look at my definition of PCA here: PCA in a RNA seq analysis]

Perform PCA / single value decomposition

project.pca <- prcomp(t(MyReadCountMatrix))
summary(project.pca)

#Determine the proportion of variance of each component
#Proportion of variance equals (PC stdev^2) / (sum all PCs stdev^2)
project.pca.proportionvariances <- ((project.pca$sdev^2) / (sum(project.pca$sdev^2)))*100

Scree plot

barplot(project.pca.proportionvariances, cex.names=1, xlab=paste("Principal component (PC), 1-", length(project.pca$sdev)), ylab="Proportion of variation (%)", main="Scree plot", ylim=c(0,100))

scree

Pairs plots

par(cex=1.0, cex.axis=0.8, cex.main=0.8)
pairs(project.pca$x[,1:5], col="black", main="Principal components analysis bi-plot\nPCs 1-5", pch=16)
pairs(project.pca$x[,6:10], col="black", main="Principal components analysis bi-plot\nPCs 6-10", pch=16)

pairs

Bi-plots

par(mar=c(4,4,4,4), mfrow=c(1,3), cex=1.0, cex.main=0.8, cex.axis=0.8)

#Plots scatter plot for PC 1 and 2
plot(project.pca$x, type="n", main="Principal components analysis bi-plot", xlab=paste("PC1, ", round(project.pca.proportionvariances[1], 2), "%"), ylab=paste("PC2, ", round(project.pca.proportionvariances[2], 2), "%"))
points(project.pca$x, col="black", pch=16, cex=1)

#Plots scatter plot for PC 1 and 3
plot(project.pca$x[,1], project.pca$x[,3], type="n", main="Principal components analysis bi-plot", xlab=paste("PC1, ", round(project.pca.proportionvariances[1], 2), "%"), ylab=paste("PC3, ", round(project.pca.proportionvariances[3], 2), "%"))
points(project.pca$x[,1], project.pca$x[,3], col="black", pch=16, cex=1)

#Plots scatter plot for PC 2 and 3
plot(project.pca$x[,2], project.pca$x[,3], type="n", main="Principal components analysis bi-plot", xlab=paste("PC2, ", round(project.pca.proportionvariances[2], 2), "%"), ylab=paste("PC3, ", round(project.pca.proportionvariances[3], 2), "%"))
points(project.pca$x[,2], project.pca$x[,3], col="black", pch=16, cex=1)

biplots

Tri-plot

require(scatterplot3d)
par(mar=c(4,4,4,4), cex=1.0, cex.main=0.8, cex.axis=0.8)

scatterplot3d(project.pca$x[,1:3], angle=-40, main="", color="black", pch=17, xlab=paste("PC1, ", round(project.pca.proportionvariances[1], 2), "%"), ylab=paste("PC2, ", round(project.pca.proportionvariances[2], 2), "%"), zlab=paste("PC3, ", round(project.pca.proportionvariances[3], 2), "%"), grid=FALSE, box=FALSE)
source('http://www.sthda.com/sthda/RDoc/functions/addgrids3d.r')
addgrids3d(project.pca$x[,1:3], grid = c("xy", "xz", "yz"))

source('http://www.sthda.com/sthda/RDoc/functions/addgrids3d.r')
addgrids3d(project.pca$x[,1:3], grid = c("xy", "xz", "yz"))

1kg

ADD COMMENTlink modified 6 weeks ago • written 17 months ago by Kevin Blighe41k
1

The third one is very fancy :)

ADD REPLYlink written 17 months ago by ATpoint15k

Hi Kevin,

I also want to perform a PCA plot. I have a read-count matrix data. You mentioned that we need to normalize data prior to PCA plot. Could you please help me how to normalize the data?

ADD REPLYlink written 17 months ago by Mehmet460

Hi Mehmet, from where did you obtain the counts or how did you produce them?

ADD REPLYlink written 17 months ago by Kevin Blighe41k

Hi Kevin,

I guess we obtain reads counts from HtSeq, and performed DEG analyses using edgeR. I have 23 samples. Previously, I generated a PCA plot using read.count.matrix data following your codes above. I also would like to ask you; what is difference between generating a PCA plot from PC1 to PC5, and PC1 to PC10?

ADD REPLYlink written 17 months ago by Mehmet460
1

Hi Mehmet, I have not used EdgeR (I prefer DESeq2), but PCA should be performed on normally-distributed counts, so, your normalised logged counts.

There is no difference between the generation of the pairs plots for PC1-5 and PC6-10. It is just a matter of fitting all of the plots on the same page. Technically, you can generate a pairs plot for PC1-PC100, but you would require a very large plotting window.

ADD REPLYlink written 17 months ago by Kevin Blighe41k

Hi Kevin, Thank you very much.

Should I do anything in my FPKM data prior to PCA plot?

I have run analyses to generate PCA plot (pairs plots). I would like to ask you a few things: 1. What are the dots in boxes? 2.What are the numbers around each boxes?

I have 17 samples in my data, but I have 15 dots in boxes in the plot.

could you please help me about how to interpret results ?

Thank you

ADD REPLYlink written 17 months ago by Mehmet460

If you have used HTSeq followed by EdgeR, then you should have logged counts(?). How does the distribution appear when you run: hist(MyData) ?

Can you post an image of the pairs plot and provide the code that you are using to execute the PCA?

You can upload figures/images here, and then copy/psate the URL here in your response.

Kevin

ADD REPLYlink written 17 months ago by Kevin Blighe41k

Hi Kevin,

I have uploaded two figures.

my commands:

project.pca <- prcomp(t(MyReadCountMatrix))
project.pca.proportionvariances <- ((project.pca$sdev^2) / (sum(project.pca$sdev^2)))*100

barplot(project.pca.proportionvariances, cex.names=1, xlab=paste("Principal component (PC), 1-", length(project.pca$sdev)), ylab="Proportion of variation (%)", main="Scree plot", ylim=c(0,100))

par(cex=1.0, cex.axis=0.8, cex.main=0.8)    
pairs(project.pca$x[,1:5], col="black", main="Principal components analysis bi-plot\nPCs 1-5", pch=16)
ADD REPLYlink modified 8 months ago by RamRS21k • written 17 months ago by Mehmet460
1

You need to post the URL's for the figures in your post.

ADD REPLYlink written 17 months ago by genomax65k

Sorry,

Please have a look

plot1


plot2

ADD REPLYlink modified 8 months ago by RamRS21k • written 17 months ago by Mehmet460

Looks okay to me. In the second one, the title is incorrect though, as it's PC1-10. You could also make the dots smaller here, but it's not that important as these plots are mainly just for QC.

What's the variance explained on PC1 and 2?

Looks fine.

Edit: as you go further down along the PCs, you always find 'odd' plots like the one for PC10. Nothing to worry about, though. I'd worry if I saw that on PC1-3

ADD REPLYlink modified 17 months ago • written 17 months ago by Kevin Blighe41k

Thank you for help.

I do not know variance between PC1 and PC2. When I run commands, I did not get variance value. I have 17 PCs when I type below:

    Standard deviations:
 [1] 3.49034964 1.22304723 1.05526472 0.99765777 0.76043084 0.56623044 0.37994839
 [8] 0.24585524 0.23052947 0.17215343 0.08918422 0.07063718 0.06521970 0.05890600
[15] 0.04826593 0.04446027 0.03575958

How to interpret the graphs? and what is the meaning of the dots in the boxes? As I have 17 PCs, should I show 17 PCs in the plot?

In some tutorials I saw that people write column names instead of PC1, PC2 .....

Should I use column names also?

There are several type of PCA plots. Which should I use in my paper?

Like this one:

https://www.bioconductor.org/help/workflows/rnaseqGene/#pca-plot

Sorry for asking too many questions.

Thank you

ADD REPLYlink modified 17 months ago • written 17 months ago by Mehmet460
1

The PCA bi-plot at https://www.bioconductor.org/help/workflows/rnaseqGene/#pca-plot is generated using the plotPCA function in DESeq2. It is usually used when processing RNA-seq data. However, this function is biased because it removes a large proportion of genes of low variance from the dataset prior to performing PCA, which defeats the purpose of conducting the PCA in the first place. My methods perform a completely unbiased PCA.

  • If you follow my code, the percentage of overall variance explained by each principal component will be held in project.pca.proportionvariances
  • Each dot is a sample.
  • You do not have to plot all PCs from 1-17. Most people only plot PC1 versus PC2. Look at my code at the top of this page under the heading 'Bi-plots'
  • PCA is primarily used for QC purposes to show that there are no outliers in the dataset. Looking at your PCA plots, there are no outliers. That is all you need to say. It is possible to go a lot more advanced into PCA, but, I would not recommend it unless you fully understand the PCA process
ADD REPLYlink modified 17 months ago • written 17 months ago by Kevin Blighe41k

Hi Kevin sorry for the random question, but would a PCA plot be useful to see the expression of different genotypes to a treatment. I have 5 genotypes and would like to see how the expression of each pair interact in a plot. Thanks

ADD REPLYlink written 14 months ago by catagui30

Hello, yes, it would be interesting to see that. You have RNA-seq data for these samples?

ADD REPLYlink written 14 months ago by Kevin Blighe41k

Yes I do have 5 genotypes (3 samples per geno) and a control and treatment.

ADD REPLYlink written 14 months ago by catagui30

Just a quick additional question, I have normalized my counts using DESeq, and then log transformed (log2) my data, but do you do something about the zero's in the datasets when doing the log transformation? Since they will give a infinite negative value. Do you add something to those zero's? Also, do you do any removal of lowly expressed genes, nor not for this kind of analysis? Thanks!!

ADD REPLYlink written 7 months ago by m.laarman10

Oh, to manage those, you should use the regularised log transformation, the function for which is supplied with DESeq2: https://rdrr.io/bioc/DESeq2/man/rlog.html

To then access the r-logged counts, you use assay():

x <- rlog(dds)
assay(x)

PCA is then performed on these r-logged counts.

ADD REPLYlink written 7 months ago by Kevin Blighe41k

Hi Kevin, I'm sorry to be asking more questions, I'm totally new to this. I'm trying to use the code you sent me, but it's giving me errors:

"rlog() may take a long time with 50 or more samples, vst() is a much faster transformation Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are not integers"

I'm wondering if this is based on the way I normalized; I used a script from a colleague from DESeq (not DESeq2), using the following code:

conds <- factor(c(colnames(counts)))
cds <- newCountDataSet(counts, conds)
cds <- estimateSizeFactors(cds)

normalized_counts <- counts(cds, normalized=TRUE)

Do you have an example of code that could get me from raw counts to logged normalized counts to use in PCA?

If you need more info on my samples please let me know. Thanks!

ADD REPLYlink written 7 months ago by m.laarman10

Oh, that error is common. When the dataset is large, it is indeed better to run vst(), and then you access these again via the assay() function. These are perfectly fine in place of logged counts.

If you're starting from your own data matrix of raw counts (cts), then you just need:

library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)
dds <- DESeq(dds)

normalized_counts <- counts(dds, normalized=TRUE)

Are you actually testing anything as in differential expression? Here, condition would be, like, the CaseControl variable.

Then, for vst():

vsd <- vst(dds, blind=TRUE)
assay(vsd)

Note that DESeq2 already has its own PCA implementation: https://bioconductor.org/packages/3.7/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#principal-component-plot-of-the-samples

If you have issues running DESeq2 itself, you may consider opening a new question

ADD REPLYlink modified 7 months ago • written 7 months ago by Kevin Blighe41k

Thanks! I'm going to try around with DESeq2 and if I need more help I will open a new question.

Just one more question for now: do you know the MDS plot? I've used it during my analysis in edgeR, and do you know how this compares to a PCA plot? I've understood that also here the distance between the dots in the plot is a measure of distance between the datasets (Euclidean distance) (explanation in R: This function is a variation on the usual multdimensional scaling (or principle coordinate) plot, in that a distance measure particularly appropriate for the microarray context is used. The distance between each pair of samples (columns) is the root-mean-square deviation (Euclidean distance) for the top top genes. Distances on the plot can be interpreted as leading log2-fold-change, meaning the typical (root-mean-square) log2-fold-change between the samples for the genes that distinguish those samples.)

Would you know how this would compare to PCA?

ADD REPLYlink written 7 months ago by m.laarman10

Sorry for another reply. I've run the code you gave me and it worked, but I still have the same 'problem' now, in that there is no normal distribution in my data, since the large number of genes that had 0 counts before, now just shifted to around 4 (after normalization and log transformation). Therefore, a histogram of all genes of one of my samples looks like this:

histogram distribution of normalized logged counts from sample 1

I think the data except for the large bar of unexpressed genes might have a distribution more similar to a normal distribution, but how to deal with this?

ADD REPLYlink modified 7 months ago • written 7 months ago by m.laarman10

Hey, that does not look like logged or variance-stabilised data. Can you show the code that you used? Regarding MDS and PCA, the mathematical formula behind each is different, but they both ultimately show relationships between samples / variables. Note that MDS can be performed using the PCA transformation.

You may consider a new question for that, as my time is extremely limited today.

ADD REPLYlink written 7 months ago by Kevin Blighe41k

Hi Kevin,

Thanks! That doesn't sound good.. I've started a new post here: How to normalize count data for PCA in R - something goes wrong in case you would like to comment which I would greatly appreciate. I've posted the code there as well.

Thanks for your help so far!

ADD REPLYlink written 7 months ago by m.laarman10

Hi Kevin, I am a new one. Would you please explain how to set legends with color in biplot? My data: each column is a sample, each row is a gene. Thank you very much.

ADD REPLYlink written 8 months ago by linjc.xmu10

Hello, you can add a custom legend to any plo using something like:

legend("bottomright", bty="n", cex=0.8, title="MyFirstLegend", c("Chinese", "Iraqis", "Iranians", "Syrians", "Russians"), fill=c("red1", "black", "forestgreen", "red2", "darkred"))

The first parameter can be any of "bottomright", "bottom", "bottomleft", "left", "topleft", "top", "topright", "right", or "center"

Kevin

ADD REPLYlink written 8 months ago by Kevin Blighe41k
1

Thank you very much!

ADD REPLYlink written 8 months ago by linjc.xmu10

How can I put the names of the samples in this case? And different colors for it group? Because in the pipeline there are just the basic information to plot a simple graph.

Tks

ADD REPLYlink written 8 months ago by silas00880
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1341 users visited in the last hour