Extraction of high variance or contributing genes using pcatools
0
0
Entering edit mode
2.3 years ago

Hi,

I am working with the qRT-PCR log2FC data (96 gene on rows * samples on columns) containing healthy controls and patients treated with different stimulations. I am using this log2FC data.frame along with the sample metadata in the PCAtools for plotting the PCA biplot.

p <- pca(log2FC.df, metadata = Sample_metadata, center = TRUE, scale = FALSE, removeVar = 0.1)
-- removing the lower 10% of variables based on variance

I believe I can extract the genes from p$loadings in pcatools which is similar to p$rotation in prcomp will output the components which contributes to the strongest PCs. There are 96 PCs all together, and 80 genes in the row. I am only interested in extracting genes in PC1, and PC2 (largest %), but all remaining PCs (3,......, 96) also shows the same genes. I am bit confused about this. Should PC1 and PC2 loadings should be sorted and extracted? Additionally, I would like to re-plot the PCA using these PC1 and PC2 loading metrics, does it makes sense or should I extract or subset the original log2FC data.frame corresponding to these 80 genes, and then re-plot the PCA?

p$loadings[,c("PC1", "PC2")]

dim(p$loadings[,c("PC1", "PC2")])

PC1.2 <- as.data.frame(p$loadings[,c("PC1", "PC2")])

Thank you,

Toufiq

PCAtools FactoMineR R prcomp PCA • 1.5k views
ADD COMMENT
1
Entering edit mode

I am the PCAtools main developer. What is it that you would like to do? The variable / component loadings give a value that is unitless but that represents the strength of each gene / protein / variable to each PC.

ADD REPLY
0
Entering edit mode

Kevin Blighe thank you for the prompt reply. I would like to extract the highly contributing genes from PC1 and PC2 and replot the 2D PCA or scatter plot.

ADD REPLY
2
Entering edit mode

I would first identify the top 10, 20, or 50 genes based on component loading (absolute values), then filter your input data for these, and then re-perform PCA. I am not sure that this procedure is standard though. What are you hoping to achieve?

ADD REPLY
2
Entering edit mode

agree - for this you would just need to sort by the PC loading for a given PC then take top few. But for what purpose? it might be that there is a better suggestion depending on goal

ADD REPLY
0
Entering edit mode

Kevin Blighe and Vincent Laufer

Thank you. I am interested in extracting highly variance genes and plotting the data. My log2FC data.matrix contains total of 96 genes, hence there was a scattered distributions of stimulations conditions. I thought of extracting top genes contributing to PC1 and PC2, and then re-plot the data with these genes.

Extract.Features.PCA

Extract.Features.PCA <- as.data.frame(rownames(p$loadings[c(1:50),c("PC1", "PC2")]))
names(Extract.Features.PCA) <- c("Gene_Symbols")
names(Extract.Features.PCA)
rownames(Extract.Features.PCA) <- Extract.Features.PCA$Gene_Symbols

Plot heatmap

PC1.PC2 <- log2FC[rownames(Extract.Features.PCA), ]
library(ComplexHeatmap)
Heatmap(PC1.PC2)

Plot PCA

p_PC1.PC2 <- pca(PC1.PC2, metadata = Sample_metadata, center = TRUE, scale = FALSE, removeVar = 0.1)

biplot(p_PC1.PC2,
       x = 'PC1', y = 'PC2',
       lab = NULL,
       colby = 'Stim', colkey = c("Stim 1" = "#4FF300", "Stim 2" = "#FFEE07",  "Stim 3" = "#000000"),
       legendPosition = 'right', legendLabSize = 13, legendIconSize = 3.0,
       shape = 'Subject', shapekey = c('A' = 8, "B" = 15, "C" = 17, "D" = 18),
       subtitle = 'PC1 vs. PC2')
ADD REPLY

Login before adding your answer.

Traffic: 2320 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6