Question

How do I interpert PCA before differential expression analysis

1

Entering edit mode

3 months ago

gogeni5529 ▴ 80

I want to do a differential expression analysis for a data set with six conditions. I started with the EDA, but after plotting my PCA I see two samples, which can be considered outliers.

I know that s4 might be one because in the sequencing we got "only" 2M reads while the others have a lot more, bit how can I figure out why sample s2 is showing such a behaviour?

The fastqc run didn't show any difference, duplication is normal GC is good. Is there a way to understand better, how the counts differ from the other samples?

my code and plot are shown below?

thanks

sampleDists <- dist(t(assay(vsd.cts)))
sampleDistMatrix <- as.matrix( sampleDists )
colnames(sampleDistMatrix) <- colData(vsd.cts)$sample
rownames(sampleDistMatrix) <- colData(vsd.cts)$sample
... # setting colors
pcaData <- plotPCA(vsd.cts, intgroup = c("sample", "condition", "treatment"), returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
# When the warning Consider increasing max.overlaps or similar comes
# options("ggrepel.max.overlaps", default = Inf) or 
options(ggrepel.max.overlaps = Inf)

ggplot(pcaData, aes(x = PC1, y = PC2, color = treatment, shape = condition )) +
  geom_point() + 
  scale_color_manual(values = c25[1:nr_treatment]) +
  geom_text_repel(aes(PC1, PC2, label = pcaData$name),size = 2)  +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
#  coord_fixed() +
#  theme(plot.margin=grid::unit(c(0,0,0,0), "mm")) +
  ggtitle("PCA with vst-transformed data, colored by conditions") +
  labs(shape = "Condition", color = "Treatment" )

PCA

EDA DESeq2 PCA QC • 961 views

ADD COMMENT • link updated 3 months ago by Mensur Dlakic ★ 30k • written 3 months ago by gogeni5529 ▴ 80

1

Entering edit mode

https://www.bioconductor.org/packages/devel/bioc/vignettes/DEGreport/inst/doc/DEGreport.html

I'm not sure, but this tool may be helpful for you. It will tell you correlations between PC components and covariates, lib.size and etc.

Covariates effect on count data

Another important analysis to do if you have covariates is to calculate the correlation between PCs from PCA analysis to different variables you may think are affecting the gene expression. This is a toy example of how the function works with raw data, where clearly library size correlates with some of the PCs.

ADD REPLY • link 3 months ago by jkim ▴ 230

0

Entering edit mode

In case it becomes important to preserve S4, then resequence that library to get more depth. Should be plenty of material from library prep left in a typical sequencing run. As for S2, can be pretty much anything. Explore whether any known covariate other than treatment and condition might peak there (maybe RIN) and if it doesnt then either remove of downweight. limma-arrayWeights can easily downweight outliers rather than removing it.

ADD REPLY • link 3 months ago by ATpoint 90k

score 3 · Answer 1 · 2025-07-23

I know that s4 might be one because in the sequencing we got "only" 2M reads while the others have a lot more, bit how can I figure out why sample s2 is showing such a behaviour?

There are a load of different ways to inspect these sample with relation to the others.

I usually start by looking at a presence/absence PCA to see if you at least captured the same transcripts as the other replicates.
You can also inspect the samples for contamination using something like Kraken2.
If all still looks fine, I would proceed with the downstream analyses, and see where the main differences are coming from. Is there enrichment in a biological function that could make sense in your experimental design?

If S2 is a technical replicate, then it's clear something went wrong between sampling and sequencing, so could be due to poor extraction, amplification, library prep, etc... But if this is a biological replicate, it could be real and your condition has less effect than you think, meaning something you didn't control is the cause of the variation. S4 simply looks like something just went wrong processing the sample IMO.

score 2 · Answer 2 · 2025-07-23

2

Entering edit mode

3 months ago

Mensur Dlakic ★ 30k

To my eye there are some problems beyond what you noticed.

First, It is almost a guarantee that s4 must be removed. PC axes usually have values in single or low double digits, and that sample alone stretches the X-axes to >150. Most of the variance lies in how that sample is different from others. Similar logic applies to s2, albeit to a smaller degree.

Second, there seem to be problems even in your clustered samples. s12 is a KO but appears more similar to WTs, and that trend is also seen with s13. Red samples are mixed as well. Some of these secondary problems may be reduced when you remove s2.

ADD COMMENT • link 3 months ago by Mensur Dlakic ★ 30k

0

Entering edit mode

Since PC1 and PC2 are not corresponding to anything meaningful in most of the samples, I wouldn't draw any conclusions about the other samples' coordinates on those axes.

Omit the two outliers, and the rest of the samples will cluster very differently.

ADD REPLY • link 3 months ago by swbarnes2 15k

0

Entering edit mode

I already noted: Some of these secondary problems may be reduced when you remove s2.

ADD REPLY • link 3 months ago by Mensur Dlakic ★ 30k