Question

PCA tpm fpkm

5

Entering edit mode

7.6 years ago

riccardo ▴ 90

Hi all, I have expression values in TPM and FPKM from RSEM. Is it correct to perform a PCA direclty on these values or I have to manipulate these values before to do the PCA? Thank you.

Riccardo

PCA TPM FPKM • 12k views

ADD COMMENT • link updated 7.6 years ago by Ar ★ 1.1k • written 7.6 years ago by riccardo ▴ 90

1

Entering edit mode

Thanks for your answer. It was very clear. Normally I use DESeq2 and I can perform a PCA with rlog or VST and are very useful. Now I have used RSEM and I have TPM and FPKM and I cannot use the DESeq2 PCA because I have not integer value. If I used log2(FPKM) or log2(TPM) could I see, correctly, batch effects and or biological differences between samples? Thank you.

Riccardo

ADD REPLY • link 7.6 years ago by riccardo ▴ 90

1

Entering edit mode

My opinion is somewhat different from using TPM (or FPKM or RPMK) for PCA. As I know, these values (TPM, FPKM, RPMK) are kind of frequency of read counts. Frequency may have confounding effect on real expression level of genes.

You may have the same level in frequency, but they could be different level in real expression, because frequency is calculated depending on the level of other gene expression like individual gene counts/total gene counts (and scale up (down?) to M in TPM, FPKM, and RPKM).

Also, I guess you scale (or standardize) the values before you perform PCA. If so, I don't know if it's right to scale the values of frequencies that are already scaled by depth.

So, I guess it would be better to use read counts after normalizing them with proper method, like vchris_ngs said above, or to use EC (expected counts) normalized by lenght of genes if you did RSEM (then you must have EC values).

It could be just my opinion, or maybe I be wrong. Please correct my thought, if you find it's wrong. Thanks,

ADD REPLY • link 6.9 years ago by mhyunjunkang ▴ 110

0

Entering edit mode

Please used ADD COMMENT to answer to an earlier post, as such threads remain logically structured and easy to follow.

ADD REPLY • link 7.6 years ago by WouterDeCoster 47k

0

Entering edit mode

Hello riccardo.panero!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?p=198521

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link 7.6 years ago by WouterDeCoster 47k

score 5 · Answer 1 · 2016-09-05

I would recommend to do log-transformation of the TPM/RSEM dataset. PCA has a hidden assumption of normality. PCA finds the coordinate system such that it can maximize the variance between the points. This achieved using orthogonal principal components. In case of multivariate Gaussian distribution (for example: microarray dataset), orthogonal components implies that there is zero correlation between the components. However, it is not true for dataset with poission or negative binomial distributions (like RNA-Seq counts, tpm, rpkm). Also, RNA-Seq datasets are very skewed and since, PCA is very sensitive to outliers, it is not recommended to do PCA on these datasets. Instead, do a log transformation and then plot PCA. If you are not interested in doing log transformation, then use cmdscale function for MDS plots.

Update: Code for PCA plots Suppose dat is your RPKM/TPM dataset. Make a genotype and/or condition vector.

genotype = c("KO1", "KO1", "WT1", "WT1","KO1", "WT1")
logTransformed.dat = log2(dat+ 1)
pcs = prcomp(t(logTransformed.dat), center = TRUE)
percentVar = round(((pcs$sdev) ^ 2 / sum((pcs$sdev) ^ 2)* 100), 2) 

## PCA Plot
ggplot(as.data.frame(pcs$x), aes(PC1,PC2), environment = environment()) + 
      xlab(makeLab(percentVar[1],1)) + ylab(makeLab(percentVar[2],2)) + ggtitle(title) + 
      geom_point(size = 8, aes(colour = genotypes)) +  
      theme(legend.text = element_text(size = 16, face = "bold"),
            legend.title = element_text(size = 16, colour = "black", face = "bold"),
            plot.title = element_text(size = 0, face ="bold"),
            axis.title = element_text(size = 18, face = "bold"),
            axis.text.x = element_text(size = 16, face = "bold", color = "black"),
            axis.text.y = element_text(size = 16, face = "bold", color = "black"),
            plot.margin = unit(c(0.5,0.5,0.5,0.5), "cm"))

For Batch Effects, check if the samples are clustering together or not or is it clustering based on batches (if the batches is known). Check what does principal component 1 and 2 tells you about the dataset.

score 4 · Answer 2 · 2016-09-05

4

Entering edit mode

7.6 years ago

ivivek_ngs ★ 5.2k

Take a look at this link and here

It largely boils down to what your intention is. For looking at hierarchy of samples based on expression values here in your case either TPM or FPKM you can do the PCA on them, this gives you a clear visualisation of how your samples are organised in orthogonal space based on the first 2 PCs which ideally should be able to capture most of the variability.

For sample PCA it really does not matter for gene length normalisation if your libraries have similar depth , if not then you can normalise them for depth and gene length obtain the FPKM/TPM and summarise to matrix with gene/transcript name in rows and corresponding fpkm/tpms in columns for samples and make PCA.

But you want to see how genes are organised on the space then definitely gene length normalisation is needed

Take a look here

Since in lab we always do not have always all the samples run at the same time and the coverage might not be similar at all times so it might have some batches. In that case you might take the read counts , normalise them with cpm and do a pca to see if batch effects are visible or not and then use the variables that might contribute to the effect and correct it and then visualise the pca again.

FPKM/TPM PCA is also done but they are just for visualisation purpose but for more downstream requirement like differential expression analysis, estimating batch effects and or removal then try to use read count metrics, normalise them with DESeq2/limma/edgeR, make PCA , try to see the effect , if there is then correct for the effects on the read counts and then again normalise that and replot the PCA again. This kind of work is well documented with limma(using combat/sva from svaseq). I hope this makes sense now.

ADD COMMENT • link 7.6 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Thanks for your answer. It was very clear. Normally I use DESeq2 and I can perform a PCA with rlog or VST and are very useful. Now I have used RSEM and I have TPM and FPKM and I cannot use the DESeq2 PCA because I have not integer value. If I used log2(FPKM) or log2(TPM) could I see, correctly, batch effects and or biological differences between samples? Thank you.

Riccardo

ADD REPLY • link 7.6 years ago by riccardo ▴ 90

0

Entering edit mode

I would encourage you to put this not as an answer but as a comment or reply to my answer to align by the norms of biostars.

Coming to your query , as I said , it depends on your biological question. If you are just trying to see how your samples are organised based on the fpkm or tpm then PCA will be able to show the difference in biological conditions/samples and also certain extent of batch effect. But correcting for batch effect should not be done on FPKM/TPM data. They should be done on count data which are not normalised and then normalise the corrected data and visualise with PCA to see to what extent this data was corrected for batch effects.

ADD REPLY • link 7.6 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Hello vchris, Thank you for your discussions! Seeing many threads about the normalization/preprocessing, I want to put some code pieces here, and look forward to your comments and help. Comparing to DEG analysis, I care more about data preparation.

library(Rsubread)
library(edgeR)

fc <- featureCounts(...)
x <- DGEList(counts = fc$counts, genes = fc$annotation[, c("GeneID", "Length")])

# filter
keep <- rowSums(cpm(x) > 1) >= 3
x.keep <- x[keep, , keep.lib.sizes = FALSE]

# TMM normalization
x_tmm <- calcNormFactors(x.keep)

# get normalized count
log_cpm_tmm <- cpm(x_tmm, log=T, prior.count=1)

# some exploratory work on samples
## PCA
pca <- prcomp(t(log_cpm_tmm))
## hierarchical clustering, MDS plot, or some thing else ...

What make me confused is: according to you, if batch effect is observed, I should correct it on the raw count data (fc$counts in this case), is that right? Then the corrected data will be float. How to re-build a DGEList object and re-normalize it?

See discussions of here: Using ComBat on RNASeq FPKM counts, log2(FPKM+1) was used for ComBat correction.

I'm starting to analyze RNA-seq data and want to make it right. Sorry if there are any oversight.

Thank you!

ADD REPLY • link 6.1 years ago by niuyw ▴ 40

0

Entering edit mode

Apologies for the late reply. There are many such Biostar examples floating.

In short, these are for visualization purposes. Correcting for batch on your FPKM/TPM and looking into PCA/MDS. This only guides us with our data-driven vs hypothesis driven goals. For DGElist or any differential analysis, batch correction can be done using available functions from edgeR or simply use the batch as a factor in the model design matrix to take care of it in a multifactorial design before running differential analysis using edgeR/DESeq2/limma-voom. What we feed is count matrix. A comprehensive tutorial can be found here for limma.

ADD REPLY • link 4.0 years ago by ivivek_ngs ★ 5.2k

score 1 · Answer 3 · 2016-09-05

1

Entering edit mode

7.6 years ago

WouterDeCoster 47k

That sounds valid to me.

ADD COMMENT • link 7.6 years ago by WouterDeCoster 47k