Question: PCA tpm fpkm
gravatar for riccardo
3.8 years ago by
riccardo80 wrote:

Hi all, I have expression values in TPM and FPKM from RSEM. Is it correct to perform a PCA direclty on these values or I have to manipulate these values before to do the PCA? Thank you.


tpm fpkm pca • 5.9k views
ADD COMMENTlink modified 3.8 years ago by Ar980 • written 3.8 years ago by riccardo80

Thanks for your answer. It was very clear. Normally I use DESeq2 and I can perform a PCA with rlog or VST and are very useful. Now I have used RSEM and I have TPM and FPKM and I cannot use the DESeq2 PCA because I have not integer value. If I used log2(FPKM) or log2(TPM) could I see, correctly, batch effects and or biological differences between samples? Thank you.


ADD REPLYlink written 3.8 years ago by riccardo80

My opinion is somewhat different from using TPM (or FPKM or RPMK) for PCA. As I know, these values (TPM, FPKM, RPMK) are kind of frequency of read counts. Frequency may have confounding effect on real expression level of genes.

You may have the same level in frequency, but they could be different level in real expression, because frequency is calculated depending on the level of other gene expression like individual gene counts/total gene counts (and scale up (down?) to M in TPM, FPKM, and RPKM).

Also, I guess you scale (or standardize) the values before you perform PCA. If so, I don't know if it's right to scale the values of frequencies that are already scaled by depth.

So, I guess it would be better to use read counts after normalizing them with proper method, like vchris_ngs said above, or to use EC (expected counts) normalized by lenght of genes if you did RSEM (then you must have EC values).

It could be just my opinion, or maybe I be wrong. Please correct my thought, if you find it's wrong. Thanks,

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by mhyunjunkang40

Please used ADD COMMENT to answer to an earlier post, as such threads remain logically structured and easy to follow.

ADD REPLYlink written 3.8 years ago by WouterDeCoster44k

Hello riccardo.panero!

It appears that your post has been cross-posted to another site:

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 3.8 years ago by WouterDeCoster44k
gravatar for Ar
3.8 years ago by
United States
Ar980 wrote:

I would recommend to do log-transformation of the TPM/RSEM dataset. PCA has a hidden assumption of normality. PCA finds the coordinate system such that it can maximize the variance between the points. This achieved using orthogonal principal components. In case of multivariate Gaussian distribution (for example: microarray dataset), orthogonal components implies that there is zero correlation between the components. However, it is not true for dataset with poission or negative binomial distributions (like RNA-Seq counts, tpm, rpkm). Also, RNA-Seq datasets are very skewed and since, PCA is very sensitive to outliers, it is not recommended to do PCA on these datasets. Instead, do a log transformation and then plot PCA. If you are not interested in doing log transformation, then use cmdscale function for MDS plots.

Update: Code for PCA plots Suppose dat is your RPKM/TPM dataset. Make a genotype and/or condition vector.

genotype = c("KO1", "KO1", "WT1", "WT1","KO1", "WT1")
logTransformed.dat = log2(dat+ 1)
pcs = prcomp(t(logTransformed.dat), center = TRUE)
percentVar = round(((pcs$sdev) ^ 2 / sum((pcs$sdev) ^ 2)* 100), 2) 

## PCA Plot
ggplot($x), aes(PC1,PC2), environment = environment()) + 
      xlab(makeLab(percentVar[1],1)) + ylab(makeLab(percentVar[2],2)) + ggtitle(title) + 
      geom_point(size = 8, aes(colour = genotypes)) +  
      theme(legend.text = element_text(size = 16, face = "bold"),
            legend.title = element_text(size = 16, colour = "black", face = "bold"),
            plot.title = element_text(size = 0, face ="bold"),
            axis.title = element_text(size = 18, face = "bold"),
            axis.text.x = element_text(size = 16, face = "bold", color = "black"),
            axis.text.y = element_text(size = 16, face = "bold", color = "black"),
            plot.margin = unit(c(0.5,0.5,0.5,0.5), "cm"))

For Batch Effects, check if the samples are clustering together or not or is it clustering based on batches (if the batches is known). Check what does principal component 1 and 2 tells you about the dataset.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Ar980

prcomp needs a matrix with genes as columns right ?

ADD REPLYlink written 3.5 years ago by Picasa530

you need to define "makeLab"

ADD REPLYlink written 2.2 years ago by james.e.fifer0

I second that answer to being careful with PCA. We do it but often we do not try to understand the pros and cons of using it in array/RNASeq data. The underlying mathematics and assumptions need to be well thought for any dimension reduction methods. Again given data space is not big, one also need to rethink the use of PCA. Over the years, handling various different types of omics, I have shifted to MDS as it does not make such assumptions. There is also a large body of multivariate statistics method papers/material that often make compelling point of thinking through while choosing PCA or MDS.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by ivivek_ngs4.9k
gravatar for ivivek_ngs
3.8 years ago by
Seattle,WA, USA
ivivek_ngs4.9k wrote:

Take a look at this link and here

It largely boils down to what your intention is. For looking at hierarchy of samples based on expression values here in your case either TPM or FPKM you can do the PCA on them, this gives you a clear visualisation of how your samples are organised in orthogonal space based on the first 2 PCs which ideally should be able to capture most of the variability.

For sample PCA it really does not matter for gene length normalisation if your libraries have similar depth , if not then you can normalise them for depth and gene length obtain the FPKM/TPM and summarise to matrix with gene/transcript name in rows and corresponding fpkm/tpms in columns for samples and make PCA.

But you want to see how genes are organised on the space then definitely gene length normalisation is needed

Take a look here

Since in lab we always do not have always all the samples run at the same time and the coverage might not be similar at all times so it might have some batches. In that case you might take the read counts , normalise them with cpm and do a pca to see if batch effects are visible or not and then use the variables that might contribute to the effect and correct it and then visualise the pca again.

FPKM/TPM PCA is also done but they are just for visualisation purpose but for more downstream requirement like differential expression analysis, estimating batch effects and or removal then try to use read count metrics, normalise them with DESeq2/limma/edgeR, make PCA , try to see the effect , if there is then correct for the effects on the read counts and then again normalise that and replot the PCA again. This kind of work is well documented with limma(using combat/sva from svaseq). I hope this makes sense now.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by ivivek_ngs4.9k

Thanks for your answer. It was very clear. Normally I use DESeq2 and I can perform a PCA with rlog or VST and are very useful. Now I have used RSEM and I have TPM and FPKM and I cannot use the DESeq2 PCA because I have not integer value. If I used log2(FPKM) or log2(TPM) could I see, correctly, batch effects and or biological differences between samples? Thank you.


ADD REPLYlink written 3.8 years ago by riccardo80

I would encourage you to put this not as an answer but as a comment or reply to my answer to align by the norms of biostars.

Coming to your query , as I said , it depends on your biological question. If you are just trying to see how your samples are organised based on the fpkm or tpm then PCA will be able to show the difference in biological conditions/samples and also certain extent of batch effect. But correcting for batch effect should not be done on FPKM/TPM data. They should be done on count data which are not normalised and then normalise the corrected data and visualise with PCA to see to what extent this data was corrected for batch effects.

ADD REPLYlink written 3.8 years ago by ivivek_ngs4.9k

Hello vchris, Thank you for your discussions! Seeing many threads about the normalization/preprocessing, I want to put some code pieces here, and look forward to your comments and help. Comparing to DEG analysis, I care more about data preparation.


fc <- featureCounts(...)
x <- DGEList(counts = fc$counts, genes = fc$annotation[, c("GeneID", "Length")])

# filter
keep <- rowSums(cpm(x) > 1) >= 3
x.keep <- x[keep, , keep.lib.sizes = FALSE]

# TMM normalization
x_tmm <- calcNormFactors(x.keep)

# get normalized count
log_cpm_tmm <- cpm(x_tmm, log=T, prior.count=1)

# some exploratory work on samples
## PCA
pca <- prcomp(t(log_cpm_tmm))
## hierarchical clustering, MDS plot, or some thing else ...

What make me confused is: according to you, if batch effect is observed, I should correct it on the raw count data (fc$counts in this case), is that right? Then the corrected data will be float. How to re-build a DGEList object and re-normalize it?

See discussions of here: Using ComBat on RNASeq FPKM counts, log2(FPKM+1) was used for ComBat correction.

I'm starting to analyze RNA-seq data and want to make it right. Sorry if there are any oversight.

Thank you!

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by niuyw30

Apologies for the late reply. There are many such Biostar examples floating.

In short, these are for visualization purposes. Correcting for batch on your FPKM/TPM and looking into PCA/MDS. This only guides us with our data-driven vs hypothesis driven goals. For DGElist or any differential analysis, batch correction can be done using available functions from edgeR or simply use the batch as a factor in the model design matrix to take care of it in a multifactorial design before running differential analysis using edgeR/DESeq2/limma-voom. What we feed is count matrix. A comprehensive tutorial can be found here for limma.

ADD REPLYlink written 9 weeks ago by ivivek_ngs4.9k
gravatar for WouterDeCoster
3.8 years ago by
WouterDeCoster44k wrote:

That sounds valid to me.

ADD COMMENTlink written 3.8 years ago by WouterDeCoster44k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1027 users visited in the last hour