I would recommend to do log-transformation of the TPM/RSEM dataset. PCA has a hidden assumption of normality. PCA finds the coordinate system such that it can maximize the variance between the points. This achieved using orthogonal principal components. In case of multivariate Gaussian distribution (for example: microarray dataset), orthogonal components implies that there is zero correlation between the components. However, it is not true for dataset with poission or negative binomial distributions (like RNA-Seq counts, tpm, rpkm). Also, RNA-Seq datasets are very skewed and since, PCA is very sensitive to outliers, it is not recommended to do PCA on these datasets. Instead, do a log transformation and then plot PCA. If you are not interested in doing log transformation, then use cmdscale function for MDS plots.
Update: Code for PCA plots Suppose dat is your RPKM/TPM dataset. Make a genotype and/or condition vector.
genotype = c("KO1", "KO1", "WT1", "WT1","KO1", "WT1") logTransformed.dat = log2(dat+ 1) pcs = prcomp(t(logTransformed.dat), center = TRUE) percentVar = round(((pcs$sdev) ^ 2 / sum((pcs$sdev) ^ 2)* 100), 2) ## PCA Plot ggplot(as.data.frame(pcs$x), aes(PC1,PC2), environment = environment()) + xlab(makeLab(percentVar,1)) + ylab(makeLab(percentVar,2)) + ggtitle(title) + geom_point(size = 8, aes(colour = genotypes)) + theme(legend.text = element_text(size = 16, face = "bold"), legend.title = element_text(size = 16, colour = "black", face = "bold"), plot.title = element_text(size = 0, face ="bold"), axis.title = element_text(size = 18, face = "bold"), axis.text.x = element_text(size = 16, face = "bold", color = "black"), axis.text.y = element_text(size = 16, face = "bold", color = "black"), plot.margin = unit(c(0.5,0.5,0.5,0.5), "cm"))
For Batch Effects, check if the samples are clustering together or not or is it clustering based on batches (if the batches is known). Check what does principal component 1 and 2 tells you about the dataset.
It largely boils down to what your intention is. For looking at hierarchy of samples based on expression values here in your case either TPM or FPKM you can do the PCA on them, this gives you a clear visualisation of how your samples are organised in orthogonal space based on the first 2 PCs which ideally should be able to capture most of the variability.
For sample PCA it really does not matter for gene length normalisation if your libraries have similar depth , if not then you can normalise them for depth and gene length obtain the FPKM/TPM and summarise to matrix with gene/transcript name in rows and corresponding fpkm/tpms in columns for samples and make PCA.
But you want to see how genes are organised on the space then definitely gene length normalisation is needed
Take a look here
Since in lab we always do not have always all the samples run at the same time and the coverage might not be similar at all times so it might have some batches. In that case you might take the read counts , normalise them with cpm and do a pca to see if batch effects are visible or not and then use the variables that might contribute to the effect and correct it and then visualise the pca again.
FPKM/TPM PCA is also done but they are just for visualisation purpose but for more downstream requirement like differential expression analysis, estimating batch effects and or removal then try to use read count metrics, normalise them with DESeq2/limma/edgeR, make PCA , try to see the effect , if there is then correct for the effects on the read counts and then again normalise that and replot the PCA again. This kind of work is well documented with limma(using combat/sva from svaseq). I hope this makes sense now.