Question about PCA plot using RPKM/FPKM.
1
0
Entering edit mode
3.0 years ago
hxlei613 ▴ 90

Hi, after searching for how to draw PCA plot using FPKM, there is still a question confusing me. For example, I have a FPKM matrix (let's say matrix_sample) for sample1 sample2 .. sampleN control1, control2 ... controlN (column) and gene1, gene2 ... geneN(row). I want to check if the data have batch effect. So ideally I want to see points representing samples and points representing controls are seperated into 2 parts in the plot.

I note that there are 2 method to draw PCA plots.

a) # note that in this method, rows an columns of matrix_sample are geneN and sampleN(or controlN).

   pca = prcomp(matrix_sample)
plot(pca$rotation[,1],pca$rotation[,2], xlab = "PC1", ylab = "PC2")


b) # note that in this method, matrix_sample is transposed.

   pca = prcomp(t(matrix_sample))
autoplot(p,label=TRUE)


I don't know which method is correct for a) doesn't transpose matrix and b) transpose it. I know that usually row is observation and column is variable. But in biology samples are less than genes so row is gene and column is sample. This can make the matrix more easily to understand. However for me plots are not the same generated by these 2 methods. I don't know why. I didn't find any information or I miss something. Please help me out. Thank you very much!

PCA RNA-Seq FPKM • 1.7k views
0
Entering edit mode
3.0 years ago

Edit May 18, 2020:

keep in mind that, when I answer a question, I am usually thinking from a clinical perspective where the aim is to reduce or control for absolutely ever source of bias in the data. While my original answer below is critical of FPKM, the use of FPKM with prcomp() is not entirely invalid in a research context.

If zFPKM cannot be used for whatever reason, I would recommend activating scale = TRUE:

prcomp(x, center = TRUE, scale = TRUE)


If zFPKM is used:

prcomp(zFPKM::zFPKM(x), center = TRUE, scale = FALSE)


## ---------------------

I would not use FPKM units for PCA, nor would I use these units for any analyses where sample comparisons were the intention. FPKM units are produced from a normalisation process that renders samples incomparable because there is no cross-sample normalisation in this method - some also question the within-sample normalisation that produces FPKM, too. If you must use FPKM, at least convert these units to the Z-scale via zFPKM package in R, first, i.e., before running the PCA transformation.

It is perfectly fine to perform PCA on the transposed and un-transposed data matrix. However, in each case, the x variable returned by prcomp() will naturally relate to different things, one being samples and the other your genes.

x

if retx is true the value of the rotated data (the centred (and scaled if requested) data multiplied by the rotation matrix) is returned. Hence, cov(x) is the diagonal matrix diag(sdev^2). For the formula method, napredict() is applied to handle the treatment of values omitted by the na.action. [from: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/prcomp.html]

Kevin