I want to do a differential expression analysis for a data set with six conditions. I started with the EDA, but after plotting my PCA I see two samples, which can be considered outliers.
I know that s4 might be one because in the sequencing we got "only" 2M reads while the others have a lot more, bit how can I figure out why sample s2 is showing such a behaviour?
The fastqc run didn't show any difference, duplication is normal GC is good. Is there a way to understand better, how the counts differ from the other samples?
my code and plot are shown below?
thanks
sampleDists <- dist(t(assay(vsd.cts)))
sampleDistMatrix <- as.matrix( sampleDists )
colnames(sampleDistMatrix) <- colData(vsd.cts)$sample
rownames(sampleDistMatrix) <- colData(vsd.cts)$sample
... # setting colors
pcaData <- plotPCA(vsd.cts, intgroup = c("sample", "condition", "treatment"), returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
# When the warning Consider increasing max.overlaps or similar comes
# options("ggrepel.max.overlaps", default = Inf) or
options(ggrepel.max.overlaps = Inf)
ggplot(pcaData, aes(x = PC1, y = PC2, color = treatment, shape = condition )) +
geom_point() +
scale_color_manual(values = c25[1:nr_treatment]) +
geom_text_repel(aes(PC1, PC2, label = pcaData$name),size = 2) +
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
# coord_fixed() +
# theme(plot.margin=grid::unit(c(0,0,0,0), "mm")) +
ggtitle("PCA with vst-transformed data, colored by conditions") +
labs(shape = "Condition", color = "Treatment" )
https://www.bioconductor.org/packages/devel/bioc/vignettes/DEGreport/inst/doc/DEGreport.html
I'm not sure, but this tool may be helpful for you. It will tell you correlations between PC components and covariates, lib.size and etc.
In case it becomes important to preserve S4, then resequence that library to get more depth. Should be plenty of material from library prep left in a typical sequencing run. As for S2, can be pretty much anything. Explore whether any known covariate other than treatment and condition might peak there (maybe RIN) and if it doesnt then either remove of downweight. limma-arrayWeights can easily downweight outliers rather than removing it.