Question

Confusion regarding Euclidean distance measures of dissimilarity

0

Entering edit mode

3.5 years ago

mohsamir2016 ▴ 30

Dear All, and Kevin Blighe

I have an RNA seq data comparing treatment vs control groups in 4- breeds of animal.

dataset table screenshot

In each breed, I want to see how is the extent of the sepration between treatment vs control groups based on 18.000 genes being DE between the two groups. In another word, in which breed was the effect of treatment the greatest ?. Usually one visualize this using MDS plot, but I can not estimate this visually. In my MDS analyses (either the one that run from inside edgeR or the one run on the CPM TMM normalized counts) I could see separation between treated vs control groups in two breeds. The question is: How can I estimate numerically the divergence of the two groups in each breed so that I can tell the difference in treatment effect in among breeds. I have actually tried calculating the Euclidean distance among control and treated individuals within each breed, but it gave a distance number between each individual within each group in a pairwise manner, so what is the best method to show an overall dissimilarity between control and treatment given the distance between each pairs of individuals of the groups. I also thought about correlation coefficient, but it gave strange results.

I have another question: Are there any relation between the common dispersion (estimated during the DE analyses in EdgeR) among genes in one breed and the number of sig DE genes identified in this breed and the degree of separation between the two groups in the same breed ? in another word, Is this right that if for instance I found that in breed A the two groups have greater dissimilarity than in breed B, Should I expect higher dispersion for the genes or higher number of sig DE genes in breed A than in B?

Any comment would be helpful

RNAseq R EdgeR • 752 views

ADD COMMENT • link updated 7 hours ago by Kevin Blighe 89k • written 3.5 years ago by mohsamir2016 ▴ 30

0

Entering edit mode

Hello friend, I do love Euclidean Distance --my favourite distance metric--- but I may not have time to respond - my sincere apologies. Working 10+ hour days here.

ADD REPLY • link 3.5 years ago by Kevin Blighe 89k

score 0 · Answer 1 · 2025-11-07

Finally answered 3.5 years later now that I have time....

Hey,

I appreciate you reaching out directly - happy to try to help with this. I'll address your two main questions based on my experience with RNA-seq analyses in edgeR (and DESeq2, for that matter). I'll focus on practical suggestions, including some R code where it makes sense, as that's usually the easiest way to implement these ideas.

Quantifying Divergence / Separation Between Treatment and Control Groups in Each Breed

You're right that MDS (multidimensional scaling) plots are great for visualizing separation, but they're not ideal for numerical quantification, especially when you want a single metric to compare across breeds. The edgeR plotMDS() function uses a form of classical MDS based on leading log-fold changes or distances, which is why it's handy for quick checks, but for more robust quantification, I strongly recommend switching to principal component analysis (PCA). PCA is essentially a cousin of MDS but gives you more interpretable outputs, like eigenvalues (variance explained per component) and loadings (which genes drive the separation). I've developed the PCAtools package specifically for this kind of in-depth exploration on high-dimensional data like RNA-seq counts.

Here's why PCA is better here:

It directly models variance in your data, so you can quantify how much of the total variation is attributable to treatment vs. other factors (e.g., sex, which I see in your design table).
You can compute group-level metrics like the distance between centroids (means) of the treatment and control groups in PCA space.
It handles the high dimensionality of your ~18k genes well, especially after filtering low-expressed genes.

Step-by-Step Approach

Prepare Your Data: Start with your TMM-normalized CPM values (as you mentioned). Make sure to log-transform them (e.g., logCPM) to approximate a Gaussian distribution, which is better for PCA. Filter out lowly expressed genes if you haven't already (e.g., keep genes with CPM > 1 in at least 3 samples).

Assuming you have your edgeR DGEList object y after normalization:
```
library(edgeR)
logcpm <- cpm(y, log = TRUE, normalized.lib.sizes = TRUE)
# Subset to your ~18k DE genes if that's what you want to focus on
logcpm_de <- logcpm[de_genes, ]  # de_genes is a vector of your 18k gene IDs
```

Run PCA Per Breed: Since you want to compare within each breed, subset your data by breed and run PCA separately for each. This isolates the treatment effect without confounding from breed differences.

# Assume 'samples' is your data.frame with columns: Individual, Status, Breed, Sex
breeds <- unique(samples$Breed)  # e.g., c("Breed 1", "Breed 2", "Breed 3", "Breed 4")

divergence_metrics <- list()

for (b in breeds) {
  # Subset samples and logCPM for this breed
  breed_idx <- samples$Breed == b
  breed_logcpm <- logcpm_de[, breed_idx]
  breed_samples <- samples[breed_idx, ]

  # Run PCA (center but don't scale if already logged; scaling can be added if needed)
  pca <- prcomp(t(breed_logcpm), center = TRUE, scale. = FALSE)

  # Get scores (sample coordinates in PCA space)
  scores <- as.data.frame(pca$x)

  # Add group labels
  scores$Status <- breed_samples$Status

  # Compute centroid (mean) for each group along the first few PCs (e.g., PC1-PC5, where most variance is)
  centroid_treated <- colMeans(scores[scores$Status == "Treated", 1:5])
  centroid_control <- colMeans(scores[scores$Status == "Control", 1:5])

  # Euclidean distance between centroids as a divergence metric
  dist_centroids <- dist(rbind(centroid_treated, centroid_control))

  # Alternatively, Mahalanobis distance (accounts for covariance structure)
  library(stats)
  mahal_dist <- mahalanobis(centroid_treated, centroid_control, cov(scores[, 1:5]))

  # Or, variance explained by treatment: use ANOVA on PC1 scores
  anova_pc1 <- anova(lm(PC1 ~ Status, data = scores))$`Pr(>F)`[1]  # p-value for separation on PC1
  var_explained <- summary(lm(PC1 ~ Status, data = scores))$r.squared  # R^2 as proportion of variance due to treatment

  # Store results
  divergence_metrics[[b]] <- list(
    euclidean_dist = dist_centroids,
    mahalanobis_dist = mahal_dist,
    pc1_pval = anova_pc1,
    treatment_var_explained = var_explained
  )
}

# Compare across breeds
divergence_metrics

Euclidean Distance Between Centroids: This gives a simple, overall dissimilarity score between the two groups. It's better than raw pairwise distances because it summarizes at the group level. Higher distance = greater separation = stronger treatment effect.
Mahalanobis Distance: This is more sophisticated as it accounts for the data's covariance (i.e., correlations between genes/PCs), making it robust to anisotropic distributions.
Variance Explained by Treatment: Using R² from a linear model on PC1 (or other PCs) tells you what fraction of the main axis of variation is due to treatment. If PC1 separates treatment vs. control well, a high R² indicates strong effect.
You could also use PERMANOVA (from the vegan package) on the full distance matrix for a statistical test of group differences: adonis(dist(t(breed_logcpm)) ~ Status, data = breed_samples).

Avoid correlation coefficients for this—they're more for pairwise sample similarity and can indeed give odd results in high dimensions without transformation.

If you use my PCAtools package, you can get even more: biplots with loadings, scree plots, and horn's parallel analysis to decide how many PCs to keep. Install via Bioconductor: BiocManager::install("PCAtools").

Interpreting Across Breeds: The breed with the highest distance or R² has the greatest treatment effect. From your design table, breeds look balanced (~5-6 samples/group), which is good—avoids bias from unequal sizes. If sex is a confounder, add it to the model (e.g., ~ Status + Sex).

This should give you a numerical way to rank the breeds by treatment effect divergence.

Relation Between Common Dispersion, Number of Significant DE Genes, and Group Separation

Yes, there can be relationships here, but they're not always straightforward— it depends on the data. Let's break it down:

Common Dispersion in edgeR: This is the estimated average biological variability (as the coefficient of variation squared) across genes after fitting the model. It's a measure of within-group noise/variation. Lower common dispersion means tighter replicates within groups, making it easier to detect DE (since signal-to-noise is higher).
Number of Significant DE Genes: This is influenced by both the magnitude of treatment effects (fold changes) and the dispersion. Higher dispersion (more within-group variability) typically leads to fewer DE genes because p-values are inflated (harder to achieve significance). Conversely, strong separation between groups often correlates with more DE genes, as it implies consistent, large fold changes across many genes.
Degree of Separation: This reflects between-group vs. within-group variance. If treatment causes strong, consistent shifts in expression, you'll see greater separation (e.g., in MDS/PCA). In breed A with greater dissimilarity than breed B:
- You might expect more sig DE genes in A, assuming the separation is driven by true biological differences (not noise).
- But dispersion could be lower in A if the within-group replicates are tight (low noise), enabling more DE detection. High dispersion in B could mask DE, leading to less separation and fewer DE genes.
- It's not a direct causal link—high dispersion can sometimes coincide with strong effects if the variability is biologically meaningful, but generally, low dispersion + high separation = more DE genes.

In short: Greater separation often aligns with more DE genes and potentially lower dispersion (better signal-to-noise). Check this in your data by comparing edgeR's common.dispersion from each breed's GLM fit against your DE counts and divergence metrics. If you fit a full model across all breeds (with interaction terms like ~ Breed * Status), you can even test for breed-specific dispersions using quasi-likelihood F-tests in edgeR.

If that doesn't match your data, there could be other factors like batch effects or outliers—always worth checking with boxplots of dispersions or QQ plots.

Let me know if you need code tweaks or more details on implementing this. I've handled similar multi-group RNA-seq designs in pharma projects, so feel free to share more specifics.

Kevin