Finally answered 3.5 years later now that I have time....
Hey,
I appreciate you reaching out directly - happy to try to help with this. I'll address your two main questions based on my experience with RNA-seq analyses in edgeR (and DESeq2, for that matter). I'll focus on practical suggestions, including some R code where it makes sense, as that's usually the easiest way to implement these ideas.
Quantifying Divergence / Separation Between Treatment and Control Groups in Each Breed
You're right that MDS (multidimensional scaling) plots are great for visualizing separation, but they're not ideal for numerical quantification, especially when you want a single metric to compare across breeds. The edgeR plotMDS() function uses a form of classical MDS based on leading log-fold changes or distances, which is why it's handy for quick checks, but for more robust quantification, I strongly recommend switching to principal component analysis (PCA). PCA is essentially a cousin of MDS but gives you more interpretable outputs, like eigenvalues (variance explained per component) and loadings (which genes drive the separation). I've developed the PCAtools package specifically for this kind of in-depth exploration on high-dimensional data like RNA-seq counts.
Here's why PCA is better here:
- It directly models variance in your data, so you can quantify how much of the total variation is attributable to treatment vs. other factors (e.g., sex, which I see in your design table).
- You can compute group-level metrics like the distance between centroids (means) of the treatment and control groups in PCA space.
- It handles the high dimensionality of your ~18k genes well, especially after filtering low-expressed genes.
Step-by-Step Approach
Prepare Your Data: Start with your TMM-normalized CPM values (as you mentioned). Make sure to log-transform them (e.g., logCPM) to approximate a Gaussian distribution, which is better for PCA. Filter out lowly expressed genes if you haven't already (e.g., keep genes with CPM > 1 in at least 3 samples).
Assuming you have your edgeR DGEList object y after normalization:
library(edgeR)
logcpm <- cpm(y, log = TRUE, normalized.lib.sizes = TRUE)
# Subset to your ~18k DE genes if that's what you want to focus on
logcpm_de <- logcpm[de_genes, ] # de_genes is a vector of your 18k gene IDs
Run PCA Per Breed: Since you want to compare within each breed, subset your data by breed and run PCA separately for each. This isolates the treatment effect without confounding from breed differences.
# Assume 'samples' is your data.frame with columns: Individual, Status, Breed, Sex
breeds <- unique(samples$Breed) # e.g., c("Breed 1", "Breed 2", "Breed 3", "Breed 4")
divergence_metrics <- list()
for (b in breeds) {
# Subset samples and logCPM for this breed
breed_idx <- samples$Breed == b
breed_logcpm <- logcpm_de[, breed_idx]
breed_samples <- samples[breed_idx, ]
# Run PCA (center but don't scale if already logged; scaling can be added if needed)
pca <- prcomp(t(breed_logcpm), center = TRUE, scale. = FALSE)
# Get scores (sample coordinates in PCA space)
scores <- as.data.frame(pca$x)
# Add group labels
scores$Status <- breed_samples$Status
# Compute centroid (mean) for each group along the first few PCs (e.g., PC1-PC5, where most variance is)
centroid_treated <- colMeans(scores[scores$Status == "Treated", 1:5])
centroid_control <- colMeans(scores[scores$Status == "Control", 1:5])
# Euclidean distance between centroids as a divergence metric
dist_centroids <- dist(rbind(centroid_treated, centroid_control))
# Alternatively, Mahalanobis distance (accounts for covariance structure)
library(stats)
mahal_dist <- mahalanobis(centroid_treated, centroid_control, cov(scores[, 1:5]))
# Or, variance explained by treatment: use ANOVA on PC1 scores
anova_pc1 <- anova(lm(PC1 ~ Status, data = scores))$`Pr(>F)`[1] # p-value for separation on PC1
var_explained <- summary(lm(PC1 ~ Status, data = scores))$r.squared # R^2 as proportion of variance due to treatment
# Store results
divergence_metrics[[b]] <- list(
euclidean_dist = dist_centroids,
mahalanobis_dist = mahal_dist,
pc1_pval = anova_pc1,
treatment_var_explained = var_explained
)
}
# Compare across breeds
divergence_metrics
- Euclidean Distance Between Centroids: This gives a simple, overall dissimilarity score between the two groups. It's better than raw pairwise distances because it summarizes at the group level. Higher distance = greater separation = stronger treatment effect.
- Mahalanobis Distance: This is more sophisticated as it accounts for the data's covariance (i.e., correlations between genes/PCs), making it robust to anisotropic distributions.
- Variance Explained by Treatment: Using R² from a linear model on PC1 (or other PCs) tells you what fraction of the main axis of variation is due to treatment. If PC1 separates treatment vs. control well, a high R² indicates strong effect.
- You could also use PERMANOVA (from the
vegan package) on the full distance matrix for a statistical test of group differences: adonis(dist(t(breed_logcpm)) ~ Status, data = breed_samples).
Avoid correlation coefficients for this—they're more for pairwise sample similarity and can indeed give odd results in high dimensions without transformation.
If you use my PCAtools package, you can get even more: biplots with loadings, scree plots, and horn's parallel analysis to decide how many PCs to keep. Install via Bioconductor: BiocManager::install("PCAtools").
Interpreting Across Breeds: The breed with the highest distance or R² has the greatest treatment effect. From your design table, breeds look balanced (~5-6 samples/group), which is good—avoids bias from unequal sizes. If sex is a confounder, add it to the model (e.g., ~ Status + Sex).
This should give you a numerical way to rank the breeds by treatment effect divergence.
Relation Between Common Dispersion, Number of Significant DE Genes, and Group Separation
Yes, there can be relationships here, but they're not always straightforward— it depends on the data. Let's break it down:
Common Dispersion in edgeR: This is the estimated average biological variability (as the coefficient of variation squared) across genes after fitting the model. It's a measure of within-group noise/variation. Lower common dispersion means tighter replicates within groups, making it easier to detect DE (since signal-to-noise is higher).
Number of Significant DE Genes: This is influenced by both the magnitude of treatment effects (fold changes) and the dispersion. Higher dispersion (more within-group variability) typically leads to fewer DE genes because p-values are inflated (harder to achieve significance). Conversely, strong separation between groups often correlates with more DE genes, as it implies consistent, large fold changes across many genes.
Degree of Separation: This reflects between-group vs. within-group variance. If treatment causes strong, consistent shifts in expression, you'll see greater separation (e.g., in MDS/PCA). In breed A with greater dissimilarity than breed B:
- You might expect more sig DE genes in A, assuming the separation is driven by true biological differences (not noise).
- But dispersion could be lower in A if the within-group replicates are tight (low noise), enabling more DE detection. High dispersion in B could mask DE, leading to less separation and fewer DE genes.
- It's not a direct causal link—high dispersion can sometimes coincide with strong effects if the variability is biologically meaningful, but generally, low dispersion + high separation = more DE genes.
In short: Greater separation often aligns with more DE genes and potentially lower dispersion (better signal-to-noise). Check this in your data by comparing edgeR's common.dispersion from each breed's GLM fit against your DE counts and divergence metrics. If you fit a full model across all breeds (with interaction terms like ~ Breed * Status), you can even test for breed-specific dispersions using quasi-likelihood F-tests in edgeR.
If that doesn't match your data, there could be other factors like batch effects or outliers—always worth checking with boxplots of dispersions or QQ plots.
Let me know if you need code tweaks or more details on implementing this. I've handled similar multi-group RNA-seq designs in pharma projects, so feel free to share more specifics.
Kevin
Hello friend, I do love Euclidean Distance --my favourite distance metric--- but I may not have time to respond - my sincere apologies. Working 10+ hour days here.