In one PCA plot, can I calculate the percentage of different factors that contribute to the PCA?
2
0
Entering edit mode
16 days ago

I have a question that how to know how much contribution that each factor makes to the total PCA results, for example, use percentage to explain the contribution?

I have a gene expression matrix from six groups (a,b,c,d,e,f), in each group there are at least two replicates. The six groups are from different donors (For example: replicates in group a from donor A and donor B).

Here is my example data and related sample information:

    structure(list(a_1 = c(0.777826793724671, 0.723980296636, 0.242330621229485, 
0.74905050941743, 0.293395987479016, 0.508663943270221, 0.568264301633462, 
0.71544666425325, 0.22279090131633, 0.162438383558765), a_2 = c(0.840473337331787, 
0.750769911101088, 0.824924289714545, 0.451419109478593, 0.67439135722816, 
0.968571343459189, 0.657587153837085, 0.449371922994033, 0.460420181043446, 
0.506347043439746), b_1 = c(0.606637348420918, 0.0831037482712418, 
0.948381775757298, 0.938334624515846, 0.00419102935120463, 0.188286359421909, 
0.208198433509097, 0.0177543577738106, 0.718145203776658, 0.0831736491527408
), b_2 = c(0.321765134111047, 0.211961556924507, 0.587257345672697, 
0.144032216630876, 0.972890468547121, 0.855327832512558, 0.57097671367228, 
0.577104192459956, 0.0781774870119989, 0.114624827634543), c_1 = c(0.391017276560888, 
0.590633708517998, 0.795098746661097, 0.728984325425699, 0.493055917322636, 
0.369274253724143, 0.773150491062552, 0.975961270509288, 0.813110067509115, 
0.0556579001713544), c_2 = c(0.667865458643064, 0.632330810651183, 
0.694971590070054, 0.530965579906479, 0.965042072581127, 0.593090373789892, 
0.572259798645973, 0.175909371348098, 0.888658297248185, 0.813629044219851
), d_1 = c(0.241499250289053, 0.00761426473036408, 0.461811906658113, 
0.109057808294892, 0.866662635235116, 0.454646666068584, 0.744061368750408, 
0.863238325342536, 0.640526893315837, 0.868945126188919), d_2 = c(0.319046971853822, 
0.91230500722304, 0.501486229710281, 0.364735695533454, 0.575103351147845, 
0.466834801947698, 0.793754573678598, 0.851761769503355, 0.967630770988762, 
0.154894143110141), e_1 = c(0.816843639593571, 0.803759014001116, 
0.960935505107045, 0.574637013487518, 0.173312981147319, 0.0567971160635352, 
0.35941729741171, 0.427865072153509, 0.325001205084845, 0.553443755256012
), e_2 = c(0.951808236306533, 0.734539293451235, 0.637795145623386, 
0.0906503377482295, 0.307018371066079, 0.837945698061958, 0.575052802218124, 
0.149990632198751, 0.740633937064558, 0.614213866414502), e_3 = c(0.655502210138366, 
0.303118638927117, 0.754946870729327, 0.973303767153993, 0.15387549251318, 
0.727580216247588, 0.133633797988296, 0.990649084327742, 0.508409713627771, 
0.291543716564775), f_1 = c(0.722936338512227, 0.43016006750986, 
0.668916463851929, 0.597232702188194, 0.566613202914596, 0.492413811851293, 
0.841789987403899, 0.991420056903735, 0.654314963845536, 0.361741523491219
), f_2 = c(0.445054198848084, 0.561434917617589, 0.00869911885820329, 
0.193016097648069, 0.625879230443388, 0.440140291117132, 0.192036809166893, 
0.825253788847476, 0.706002304796129, 0.560118380701169), f_3 = c(0.481232576072216, 
0.388952540932223, 0.547279508085921, 0.0684368198271841, 0.211692467099056, 
0.0816763381008059, 0.179641566239297, 0.111117320600897, 0.471861350117251, 
0.0512471403926611), f_4 = c(0.618208972737193, 0.993882849579677, 
0.0283353771083057, 0.699978570453823, 0.377288895659149, 0.486861442681402, 
0.55804020841606, 0.609155245823786, 0.911775157321244, 0.616870308062062
), f_5 = c(0.662106775445864, 0.825681922957301, 0.838785516098142, 
0.812102147843689, 0.16824291436933, 0.873179373564199, 0.230873316759244, 
0.852764319395646, 0.354741029441357, 0.673453942872584), f_6 = c(0.696460561361164, 
0.63435076456517, 0.488687434000894, 0.974564600270241, 0.0511445237789303, 
0.850382218603045, 0.52776403632015, 0.878997334977612, 0.265999732539058, 
0.199336896184832), f_7 = c(0.155626990366727, 0.609529266366735, 
0.405262058833614, 0.832081991713494, 0.759743764763698, 0.116353970719501, 
0.945955528412014, 0.744843143504113, 0.824364016996697, 0.0772311592008919
)), class = "data.frame", row.names = c(NA, -10L))

structure(list(Group = c("a", "a", "b", "b", "c", "c", "d", "d", 
"e", "e", "e", "f", "f", "f", "f", "f", "f", "f"), Rep = c("a_1", 
"a_2", "b_1", "b_2", "c_1", "c_2", "d_1", "d_2", "e_1", "e_2", 
"e_3", "f_1", "f_2", "f_3", "f_4", "f_5", "f_6", "f_7"), Donor = c("A", 
"B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "A", "B", "I", 
"J", "C", "L", "J")), class = "data.frame", row.names = c(NA, 
-18L))

Based on this data, I produced a PCA plot.

Then I want to know if I can calculate the donor effect (A~L) and group effect(a~f) that contribute the PCA result, especially dim PC1 and dim PC2?

So I can say, 30% of the PC1 variance is contributed by donor effect, the rest PC1 70% variance can be explained by group effect?

Because in my data,I only want to focus on group effect that can distinct different groups by gene expression. But the donor effect also exist.

If there are any better methods to be used to do similar calculation?

I hope you can give me some adivce on this question.

Thanks in advance.

PCA R VARIANCE • 693 views
ADD COMMENT
2
Entering edit mode
15 days ago

Here is my script that can be used to calculate how much degree that different factors contribute to PCA variance (PC1 and PC2).

# Install and load necessary packages
install.packages("lme4")
install.packages("ggplot2")
install.packages("reshape2")
library(lme4)
library(ggplot2)
library(reshape2)

# Assuming you have your data in 'data' and 'Group_Infor' as earlier
data <- data.frame(
  matrix(runif(180, 0, 1), 10, 18)  # Example data, replace with actual data
)
colnames(data) <- paste0(rep(c("a", "b", "c", "d", "e", "f"), c(2, 2, 2, 2, 3, 7)), "_", c(1:2, 1:2, 1:2, 1:2, 1:3, 1:7))

Group_Infor <- data.frame(
  Group = rep(c("a", "b", "c", "d", "e", "f"), c(2, 2, 2, 2, 3, 7)),
  Rep = colnames(data),
  Donor = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "A", "B", "I", "J", "C", "L", "J")
)

# Perform PCA
pca_result <- prcomp(t(data), scale. = TRUE)

# Extract PC1 and PC2 scores
pc_scores <- data.frame(pca_result$x[, 1:2])
colnames(pc_scores) <- c("PC1", "PC2")
pc_scores <- cbind(pc_scores, Group_Infor)

# Fit mixed models for PC1
model_pc1 <- lmer(PC1 ~ 1 + (1 | Group) + (1 | Donor), data = pc_scores)
summary(model_pc1)

# Fit mixed models for PC2
model_pc2 <- lmer(PC2 ~ 1 + (1 | Group) + (1 | Donor), data = pc_scores)
summary(model_pc2)

# Extract variance components for PC1
var_components_pc1 <- as.data.frame(VarCorr(model_pc1))
var_components_pc1 <- var_components_pc1[1:3, "vcov"]
names(var_components_pc1) <- c("Group", "Donor", "Residual")
var_components_pc1 <- var_components_pc1 / sum(var_components_pc1)

# Extract variance components for PC2
var_components_pc2 <- as.data.frame(VarCorr(model_pc2))
var_components_pc2 <- var_components_pc2[1:3, "vcov"]
names(var_components_pc2) <- c("Group", "Donor", "Residual")
var_components_pc2 <- var_components_pc2 / sum(var_components_pc2)

# Combine and print the results
variance_explained <- data.frame(
  Component = c("Group", "Donor", "Residual"),
  PC1 = var_components_pc1,
  PC2 = var_components_pc2
)
print(variance_explained)

# Reshape data for ggplot2
variance_long <- melt(variance_explained, id.vars = "Component", variable.name = "PC", value.name = "Proportion_of_Variance")

# Create the stacked bar plot
ggplot(variance_long, aes(x = PC, y = Proportion_of_Variance, fill = Component)) +
  geom_bar(stat = "identity", color = "black") +
  theme_minimal() +
  labs(title = "Proportion of Variance Explained by Group, Donor, and Residuals",
       x = "Principal Component",
       y = "Proportion of Variance") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(values = c("Group" = "lightblue", "Donor" = "lightgreen", "Residual" = "lightcoral"))

Also, after calculation, we can make a barplot to see contribution in detail.

It works well.

But I found there was a disadvantage that here the calculation is only based on PC1 and PC2, not based on all the PC dimensions.

I am looking for other methods to see if I can get more useful information for my question.

ADD COMMENT
1
Entering edit mode
16 days ago
marco.barr ▴ 140

Hi, regardless of how you generated the data whether with DESeq2 or other pipeline, your differential expression matrix can be used as input to the factoextra package to get what you want. The components of the get_pca_var() can be used in the plot of variables as follow:

var$coord: coordinates of variables to create a scatter plot

var$cos2: represents the quality of representation for variables on the factor map. It’s calculated as the squared coordinates: var.cos2 = var.coord * var.coord.

var$contrib: contains the contributions (in percentage) of the variables to the principal components. The contribution of a variable (var) to a given principal component is (in percentage) : (var.cos2 * 100) / (total cos2 of the component).

The function fviz_contrib can be used to visualize the contribution of features to principal components. Here is a reproducible example:

library(factoextra)    
data.pca <- prcomp(your_data,scale=T, center=T)
var <- get_pca_var(data.pca)
head(var$contrib)
#contribution of variabile to component choise
fviz_contrib(data.pca, choice= "var",  axes=1, top=10)

Adapt the script to your needs

ADD COMMENT
0
Entering edit mode

Thanks a lot. It is almost what I need but not all.

From your script, I can get information that how much degree different samples contribute to dim 1 or pc1.

But I also want to know how much degree the donor affect the pc1 or pc2 in pca.

There are 18 samples (columns) from 12 donors are divided into 6 groups.

Samples from the same donor but belong to different groups can have similar expression pattern.

We can distinct 18 samples by their groups in PCA. But the difference among the 6 groups is not quite distinct because of the donor effect.

So question is how much degree the group effect and the donor effect have, respectively.

Your script can be used to calculate how much degree each sample contributes to the PCA. It's great. But not so necessary for my core question.

Thanks.

<h6>##### update:</h6>

I ask chat-gpt for this question. And it gives me a suitable method. I am having a try.

I will post message here if I have any new questions.

Thanks.

ADD REPLY
1
Entering edit mode

ok, then you should add the donor information to the PCA analysis. First create a separate dataframe for the donors, then add with factor the donor information into the PCA results. Try chat-gpt if so write here, let me know!

ADD REPLY
0
Entering edit mode

Ok, I will response to you here as soon as possible.

ADD REPLY
0
Entering edit mode

You can have a look on my answer. I have post it below.

ADD REPLY
0
Entering edit mode

great team work! why can't you access and do the calculation for the other components too? can't you extract the information from the other components and add to your dataset?

ADD REPLY
0
Entering edit mode

Yes, I can. But there are only two factors that affect my data PCA distribution: groups and donor. And the rest is regarded as residual.

I think it is suitable for me. However the factors contribution that I got was not assistant with my hypothesis. It is a sad thing.

ADD REPLY
0
Entering edit mode

what a pity...

ADD REPLY

Login before adding your answer.

Traffic: 1223 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6