How to create PCA plot with TPM data
2
0
Entering edit mode
3.3 years ago
Abigail • 0

Hi, all. I am new to RNA analysis. Now I have RNA-Seq data with 60 samples. Now I generate a TPM matrix. I want to create a PCA plot using the data to see the relationship between samples. Should I log-transformed my TPM data and filter the genes by variance as input? If so, what should I do to deal with the "-Inf" which are transformed from zero. Thanks a lot!

rna-seq • 3.1k views
ADD COMMENT
0
Entering edit mode

No need to perform log transformed to TPM data.

ADD REPLY
2
Entering edit mode
3.3 years ago

You don't need to log-transform the TPMs. Typically, the top 500 most variable genes are used. If you haven't done it yet, I recommend to read through DESeq2's vignette, it contains tons of information about RNA-seq processing and visualization. There's also the pcaExplorer package, which has great documentation and directly interfaces with DESeq2.

## Most pertinent details fromDESeq2::plotPCA()
# calculate the variance for each gene
  rv <- rowVars(assay(object))

  # select the ntop genes by variance
  select <- order(rv, decreasing=TRUE)[seq_len(min(ntop, length(rv)))]

  # perform a PCA on the data in assay(x) for the selected genes
  pca <- prcomp(t(assay(object)[select,]))

  # the contribution to the total variance for each component
  percentVar <- pca$sdev^2 / sum( pca$sdev^2 )
ADD COMMENT
0
Entering edit mode

Thanks for your suggestion. I did DESeq2 by the script in Trinity to generate the TPMs. When I use functions in R, trying to do PCA

unable to find an inherited method for function ‘XXX’ for signature ‘"data.frame"’

Similar errors were always reported. Do you know how to deal with it? Thanks again.

ADD REPLY
0
Entering edit mode

did DESeq2 by the script in Trinity to generate the TPMs

I don't know what that is, so I would need to see the entire code chunk in order to make a statement.

ADD REPLY
2
Entering edit mode
3.3 years ago
ATpoint 82k

I would strongly recommend to transform your data to log scale before running PCA. You commonly select genes for PCA based on the most variable genes (so variance). The thing is that there is a strong relationship between the mean and the variance if not working on the log scale. That means the higher the counts, the higher the variance, so selecting by variance without log-transformation will just give you genes with high expression level, but what you want is the variance driven by the differences between the samples, so PCA separation actually tells you about the differences between the samples without the confounding effect of the expression level.

Lets have an example with TPM data, tpm is a numeric matrix with TPMs in this case just two n=2 from two different cell types:

library(UpSetR)
library(matrixStats)

#/ Plot the mean vs the variance of this two-column tpm matrix with and without log2-ing the TPMs:
mean.nolog <- rowMeans2(tpm)
rv.nolog   <- rowVars(tpm)
mean.log   <- rowMeans2(log2(tpm+1))
rv.log     <- rowVars(log2(tpm+1))

#/ Select top-500 most variable genes besed on rowwise variance:
mostVar.nolog <- head(order(rv.nolog, decreasing = TRUE), 500)
mostVar.log   <- head(order(rv.log, decreasing = TRUE), 500)

#/ Pearson correlation between mean and variance:
cor.nolog <- round(cor(mean.nolog, rv.nolog), 2)
cor.log   <- round(cor(mean.log, rv.log), 2)

#/ Plot:
par(mfrow=c(2,1), bty="n")
#/ put "nolog" results on log scale for visualization (not the TPMs itself):
plot(log2(mean.nolog+1), log2(rv.nolog+1), pch=20, main = paste("Pearson:", cor.nolog))
plot(mean.log, rv.log, pch=20, main = paste("PearsonCor:", cor.log))
dev.off()

#/ Visualize the very limited overlap between the HVGs using upset:
print(upset(fromList(list(noLog = mostVar.nolog, Log = mostVar.log))))

See the below plots, the mean/variance dependency confounds the selection of genes by expression level, which is usually not what you want, therefore log2-transformation is desirable. This effect is called variance stablilization. There are dedicated approaches to do this, such as the vst function from DESeq2 when starting from raw gene level counts, but usually the log is a sufficient proxy for this and commonly used. As you'll see in the upset plot below the overlap between the top variable genes is very small between the two methods, I would use the ones based on the log-transformed counts.

From there you can go ahead with PCA, either using code inspired from the DESeq2 plotPCA function or check PCAtools at Bioconductor.

enter image description here

ADD COMMENT
1
Entering edit mode

These are great explanations.

I often want to get a feeling for the data without variance stabilization first, because vst does make a couple of strong assumptions. It is also informative to investigate why the vst and the TPM PCA differ.

ADD REPLY

Login before adding your answer.

Traffic: 2863 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6