Question

How to interpret a Scree plot

0

Entering edit mode

3.4 years ago

Lorenzo ▴ 10

Hi to all,

I'm working with RNA-seq data and I 'm trying to figure out if in my data there is a batch effect. I normalized my data in TPM through Salmon and then I run the following code

      > setwd("~/R")
      > countData <- read.csv("CountData.csv",row.names="X")
      > project.pca <- prcomp(t(countData))
      > project.pca.proportionvariances <- ((project.pca$sdev^2) / (sum(project.pca$sdev                                          
         ^2)))*100
      > barplot(project.pca.proportionvariances, cex.names=1, xlab=paste("Principal comp                                              
       onent (PC), 1-", 
       length(project.pca$sdev)), ylab="Proportion of variation (%)", main="Scree plot",                                      
       ylim=c(0,100))

I obtained a Scree plot like this:

Can someone help me to interpret this plot?

Thank you in advance for any suggestion

RNA-Seq • 1.8k views

ADD COMMENT • link updated 3.4 years ago by rpolicastro 13k • written 3.4 years ago by Lorenzo ▴ 10

1

Entering edit mode

This plot won't tell you whether you have batch effects or not, you have to look at the biplot for this to see whether samples cluster unexpectedly, or run tools like sva to check for hidden unwanted variation. To do so you probably want to summarize your salmon transcript abundance estimations to the gene level, see tximport package, and then properly normalize the counts, e.g. via DESeq2::vst or the normalization method in edgeR.

ADD REPLY • link 3.4 years ago by ATpoint 81k

0

Entering edit mode

This might be helpful: https://statquest.org/pca-clearly-explained/

ADD REPLY • link 3.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Please don't delete a question after someone spends their time to help you with it.

ADD REPLY • link 3.4 years ago by WouterDeCoster 47k

score 3 · Answer 1 · 2020-12-03

To oversimplify it a bit, PCA attempts to explain the variance in data with many dimensions (such as tens of thousands of genes), in much fewer dimensions.

For RNA-seq, each principle component (PC) is a collection of genes that explain some percentage of the total variance in your data. Take a hypothetical ideal situation where you have 3 WT samples and 3 KD samples that you did RNA-seq for. Most of the variance in the data should occur in the collection of genes that are up and downregulated between the two conditions. Therefore, a single principle component may be able to explain most of the variance in the data because these genes explain the separation of the two conditions.

Let's now imagine that an overworked scientist made a slight mistake in sample collection for the last two KD samples by growing the samples at a slightly higher temperature than the others, and this mistake propagates into changes in gene expression for those two samples compared to all others. You now have two overlapping transcription profiles in your data, the first being the separation of condition (WT vs KD), and the other being the separation of batch (normal vs slightly elevated temperature). Your first PC may still separate the WT samples from the KD samples, but you may notice that only 75% of the variance in the data is explained this time, instead of the close to 100% you got from the ideal situation. What you do notice now though is that you have a second PC that explains 25% of variance in your data. When you plot this dimension you can clearly see that the two samples grown in the slightly elevated condition are being separated from the other 4 samples grown at the correct temperature. This means that the second PC is capturing the variance in the data introduced by the temperature difference.

What your scree plot is telling you is that you have roughly 4-7 PCs that may be of interest in your data. Without knowing how many samples and conditions you have it would be difficult to give any specific reason for this. You could make a pairs plot from the excellent library PCAtools to explore how the samples are being separated by the different dimensions. Using the same library, you could also look at the loading plots to see the genes that are contributing to each dimension.