1
0
Entering edit mode
17 months ago
camillab. ▴ 50

Hi,

I am computing PCA with prcomp and I wanted to extract the loadings/rotations for PC1. My dataset (the dataset has been converted into Z scores):

# A tibble: 5 x 38
gene  P4_MU_Sepithelia P16_MU_Sepithel~
<chr>            <dbl>            <dbl>
4 APH1A           0.866            1.04
5 ATOH1           0.162           -0.545


my script for the PCA:

SE_notch3 <- SE_notch  %>% drop_na() #remove rows with NA
rnames <- SE_notch3$gene #select character variable (aka gene names) SE_notch3 <- SE_notch3[-c(1)] # remove gene symbol SE_notch.matrix <-(as.matrix(SE_notch3)) #convert into matrix rownames(SE_notch.matrix) <- rnames # assign row names all <- t(SE_notch.matrix) #transpose #pca res.pca2 <- prcomp(all, center = T, scale=F)  then I tried two different approaches to get the loadings (which in prcomp resides into the rotation parameter): method1: loading.scores <- res.pca2$rotation[,1] #PC1
tes <- sort(genes.scores, decreasing=T) #decreasing orders
top <- tes[1:10] #show the first 10 genes
top


the 10 top gene responsible to drive my PC1 according to the previous code are:

  HES7       HES1       HEY2       HES6      GATA3
0.99819716 0.02058775 0.01975547 0.01965922 0.01623756
0.01535992 0.01517099 0.01410263 0.01406065 0.01274545


but if I compute the PCA on those 10 genes, instead of seeing the sample to cluster far apart they cluster ALL together so I thought I wort the wrong code and I tried this method:

 library(broom)
library(tidyr)
tidy(matrix = "variables") %>%
filter(PC %in% c(1)) %>%   # retain only PC1
group_by(PC) %>%
arrange(desc(abs(value))) %>%  # sort descending value
slice(1:10) %>% # take top 10
pull(column) %>%  # extract the column (gene name) from the table
unique() # retain unique gene names only

top_genes


and the top genes are:

[1] "PSEN1"  "JAG2"   "CREBBP" "DTX2"   "MAML1"
[6] "LNX1"   "NCSTN"  "ADAM12" "YY1"    "ITCH"


What is the right method? Why do I get different output if the PCA is computed in the same way? Where is the error?

I have also tried to compute the PCA with FactoMineR package that doesnt show me the loadings directly but it gives me the plot of the contribution when I set which Pc I want:

library("FactoMineR") # for computing PCA
library("factoextra") # for graph
#PCA
res.pca <- PCA(all, graph = FALSE, scale.unit = FALSE)
#contribution  variables
top10<- fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)
top10


and the genes with most contributions to PC1 are those that I found with the "first" code :

HES7       HES1       HEY2       HES6      GATA3


Camilla

0
Entering edit mode
17 months ago

If all samples cluster together after you performed PC using just those 10 genes, then it tells me that they (and PC1) do not contribute too much to the overall explained variation in your dataset as you may think.

PC1 is just the PC that contributes most variation to your dataset - it may only contribute 15%, though. Also, there may not actually be any genes along PC1 that are contributing strongly to that 15% explained variation.

Ultimately, I am not sure why you want to perform PCA on a tiny subset of just 10 genes.

Kevin

0
Entering edit mode

thank you for your answer. the PC1 on the dataset contributes to the 90.9% of the variation in my dataset. I tried to use your package since it gives very nice plots but I dont understand how to get the plot of the loadings only for the PC1 and when I run

p <- pca(SE_notch.matrix, scale=F) #dont need to transpose


I get different variables retained (compare to the previous one) which I am assuming is due to the fact that it consider variables responsible for PC1 and PC2:

 -- variables retained:
HES7, ADAM10, ADAM12, ADAM17, APH1A, ATOH1, CBL, CNTN1, CNTN6, CREBBP, CTBP1, CTBP2, DLL4, DNER, DTX2, DTX3L, DTX4, DVL1, DVL3, ENO1, EPS15, FBXW7, FURIN, FUT8, GATA3, HDAC1, HDAC2, HES1, HES6, HEY1, HEY2, HEYL, ITCH, JAG1, JAG2, KAT2A, LFNG, LNX1, LNX2, MAML1, MAML3, MDK, MIB1, MYC, MYCBP, NCSTN, NLE1, NOTCH1, NOTCH2, NUMB, PSEN1, PSEN2, PTPRZ1, RBPJ, RFNG, SKP2, UPF2, YY1


also, I wanted to re-run the PCA only on those 10variables just to confirm that I was writing right the code and I was identifying correctly the genes responsible for PC1, (e.g if X number genes are the most responsible to drive the PC1 then I am assuming that re-running the PCA on them will give me a plot with all my samples far away from each other, am I wrong?)

1
Entering edit mode

I see, then you'd probably have to include a lot more than just 10 genes if you want to re-create the same structure. While PC1 is ~90%, those 10 genes that you chose could only represent a small portion of that. I cannot see your data or the bi-plots, so, I a not to know.

My plotloadings function, for example, may be a better way to do this, as, by default, it takes the genes that fall into the top 5% of the loadings range. If you choose PC1, and PC2, it will do this separately for PC1 and PC2, and then bind the results for each into a unique vector (so, top 5% PC1 + top 5% PC2`).