Question: PCA tpm fpkm
0
gravatar for riccardo
23 months ago by
riccardo40
riccardo40 wrote:

Hi all, I have expression values in TPM and FPKM from RSEM. Is it correct to perform a PCA direclty on these values or I have to manipulate these values before to do the PCA? Thank you.

Riccardo

tpm fpkm pca • 2.8k views
ADD COMMENTlink modified 23 months ago by Ar770 • written 23 months ago by riccardo40

Hello riccardo.panero!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?p=198521

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 23 months ago by WouterDeCoster31k
2
gravatar for vchris_ngs
23 months ago by
vchris_ngs4.5k
Seattle,WA, USA
vchris_ngs4.5k wrote:

Take a look at this link and here

It largely boils down to what your intention is. For looking at hierarchy of samples based on expression values here in your case either TPM or FPKM you can do the PCA on them, this gives you a clear visualisation of how your samples are organised in orthogonal space based on the first 2 PCs which ideally should be able to capture most of the variability.

For sample PCA it really does not matter for gene length normalisation if your libraries have similar depth , if not then you can normalise them for depth and gene length obtain the FPKM/TPM and summarise to matrix with gene/transcript name in rows and corresponding fpkm/tpms in columns for samples and make PCA.

But you want to see how genes are organised on the space then definitely gene length normalisation is needed

Take a look here

Since in lab we always do not have always all the samples run at the same time and the coverage might not be similar at all times so it might have some batches. In that case you might take the read counts , normalise them with cpm and do a pca to see if batch effects are visible or not and then use the variables that might contribute to the effect and correct it and then visualise the pca again.

FPKM/TPM PCA is also done but they are just for visualisation purpose but for more downstream requirement like differential expression analysis, estimating batch effects and or removal then try to use read count metrics, normalise them with DESeq2/limma/edgeR, make PCA , try to see the effect , if there is then correct for the effects on the read counts and then again normalise that and replot the PCA again. This kind of work is well documented with limma(using combat/sva from svaseq). I hope this makes sense now.

ADD COMMENTlink modified 23 months ago • written 23 months ago by vchris_ngs4.5k

Thanks for your answer. It was very clear. Normally I use DESeq2 and I can perform a PCA with rlog or VST and are very useful. Now I have used RSEM and I have TPM and FPKM and I cannot use the DESeq2 PCA because I have not integer value. If I used log2(FPKM) or log2(TPM) could I see, correctly, batch effects and or biological differences between samples? Thank you.

Riccardo

ADD REPLYlink written 23 months ago by riccardo40

I would encourage you to put this not as an answer but as a comment or reply to my answer to align by the norms of biostars.

Coming to your query , as I said , it depends on your biological question. If you are just trying to see how your samples are organised based on the fpkm or tpm then PCA will be able to show the difference in biological conditions/samples and also certain extent of batch effect. But correcting for batch effect should not be done on FPKM/TPM data. They should be done on count data which are not normalised and then normalise the corrected data and visualise with PCA to see to what extent this data was corrected for batch effects.

ADD REPLYlink written 23 months ago by vchris_ngs4.5k

Hello vchris, Thank you for your discussions! Seeing many threads about the normalization/preprocessing, I want to put some code pieces here, and look forward to your comments and help. Comparing to DEG analysis, I care more about data preparation.

library(Rsubread)
library(edgeR)

fc <- featureCounts(...)
x <- DGEList(counts = fc$counts, genes = fc$annotation[, c("GeneID", "Length")])

# filter
keep <- rowSums(cpm(x) > 1) >= 3
x.keep <- x[keep, , keep.lib.sizes = FALSE]

# TMM normalization
x_tmm <- calcNormFactors(x.keep)

# get normalized count
log_cpm_tmm <- cpm(x_tmm, log=T, prior.count=1)

# some exploratory work on samples
## PCA
pca <- prcomp(t(log_cpm_tmm))
## hierarchical clustering, MDS plot, or some thing else ...

What make me confused is: according to you, if batch effect is observed, I should correct it on the raw count data (fc$counts in this case), is that right? Then the corrected data will be float. How to re-build a DGEList object and re-normalize it?

See discussions of here: Using ComBat on RNASeq FPKM counts, log2(FPKM+1) was used for ComBat correction.

I'm starting to analyze RNA-seq data and want to make it right. Sorry if there are any oversight.

Thank you!

ADD REPLYlink modified 5 months ago • written 5 months ago by niuyw20
2
gravatar for Ar
23 months ago by
Ar770
United States
Ar770 wrote:

I would recommend to do log-transformation of the TPM/RSEM dataset. PCA has a hidden assumption of normality. PCA finds the coordinate system such that it can maximize the variance between the points. This achieved using orthogonal principal components. In case of multivariate Gaussian distribution (for example: microarray dataset), orthogonal components implies that there is zero correlation between the components. However, it is not true for dataset with poission or negative binomial distributions (like RNA-Seq counts, tpm, rpkm). Also, RNA-Seq datasets are very skewed and since, PCA is very sensitive to outliers, it is not recommended to do PCA on these datasets. Instead, do a log transformation and then plot PCA. If you are not interested in doing log transformation, then use cmdscale function for MDS plots.

Update: Code for PCA plots Suppose dat is your RPKM/TPM dataset. Make a genotype and/or condition vector.

genotype = c("KO1", "KO1", "WT1", "WT1","KO1", "WT1")
logTransformed.dat = log2(dat+ 1)
pcs = prcomp(t(logTransformed.dat), center = TRUE)
percentVar = round(((pcs$sdev) ^ 2 / sum((pcs$sdev) ^ 2)* 100), 2) 

## PCA Plot
ggplot(as.data.frame(pcs$x), aes(PC1,PC2), environment = environment()) + 
      xlab(makeLab(percentVar[1],1)) + ylab(makeLab(percentVar[2],2)) + ggtitle(title) + 
      geom_point(size = 8, aes(colour = genotypes)) +  
      theme(legend.text = element_text(size = 16, face = "bold"),
            legend.title = element_text(size = 16, colour = "black", face = "bold"),
            plot.title = element_text(size = 0, face ="bold"),
            axis.title = element_text(size = 18, face = "bold"),
            axis.text.x = element_text(size = 16, face = "bold", color = "black"),
            axis.text.y = element_text(size = 16, face = "bold", color = "black"),
            plot.margin = unit(c(0.5,0.5,0.5,0.5), "cm"))

For Batch Effects, check if the samples are clustering together or not or is it clustering based on batches (if the batches is known). Check what does principal component 1 and 2 tells you about the dataset.

ADD COMMENTlink modified 23 months ago • written 23 months ago by Ar770
1

prcomp needs a matrix with genes as columns right ?

ADD REPLYlink written 19 months ago by Picasa350

you need to define "makeLab"

ADD REPLYlink written 4 months ago by james.e.fifer0
1
gravatar for WouterDeCoster
23 months ago by
Belgium
WouterDeCoster31k wrote:

That sounds valid to me.

ADD COMMENTlink written 23 months ago by WouterDeCoster31k
1
gravatar for riccardo
23 months ago by
riccardo40
riccardo40 wrote:

Thanks for your answer. It was very clear. Normally I use DESeq2 and I can perform a PCA with rlog or VST and are very useful. Now I have used RSEM and I have TPM and FPKM and I cannot use the DESeq2 PCA because I have not integer value. If I used log2(FPKM) or log2(TPM) could I see, correctly, batch effects and or biological differences between samples? Thank you.

Riccardo

ADD COMMENTlink written 23 months ago by riccardo40
1

My opinion is somewhat different from using TPM (or FPKM or RPMK) for PCA. As I know, these values (TPM, FPKM, RPMK) are kind of frequency of read counts. Frequency may have confounding effect on real expression level of genes.

You may have the same level in frequency, but they could be different level in real expression, because frequency is calculated depending on the level of other gene expression like individual gene counts/total gene counts (and scale up (down?) to M in TPM, FPKM, and RPKM).

Also, I guess you scale (or standardize) the values before you perform PCA. If so, I don't know if it's right to scale the values of frequencies that are already scaled by depth.

So, I guess it would be better to use read counts after normalizing them with proper method, like vchris_ngs said above, or to use EC (expected counts) normalized by lenght of genes if you did RSEM (then you must have EC values).

It could be just my opinion, or maybe I be wrong. Please correct my thought, if you find it's wrong. Thanks,

ADD REPLYlink modified 14 months ago • written 14 months ago by mhyunjunkang20

Please used ADD COMMENT to answer to an earlier post, as such threads remain logically structured and easy to follow.

ADD REPLYlink written 23 months ago by WouterDeCoster31k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 720 users visited in the last hour