Hello Vincent,
This is an interesting (and common) conundrum in RNA-seq normalisation, especially when dealing with sex-biased expression across tissues. I'll try to break down what's likely happening and suggest some steps to diagnose/fix it. Your plots look clean overall, but that persistent median shift in the third tissue is a red flag that warrants digging deeper.
Quick Recap of What's Going On
- TMM (from edgeR) normalises for library size and composition (i.e., relative abundances of genes), under the assumption that most genes are not differentially expressed (DE) between groups. It uses a "reference sample" (trimmed mean of M-values) to compute scaling factors.
- FPKM (or similar, like TPM) then adjusts for gene length on top of that, but the core issue here is upstream in the count normalisation.
- In your third tissue, the high proportion of sex-specific genes (e.g., Y-chromosome genes in males, or X-inactivation escapees in females) means the "common" transcriptome is skewed: one sex has a bunch of lowly/zero-expressed genes that the other doesn't. This violates TMM's assumption, leading to over-correction (or under-correction) and that median shift.
- Your "housekeeping" approach (filtering to shared genes) is a smart intuition—it's akin to using a custom reference set—but it's not fully baked into edgeR's TMM. It might still pull in some bias if those shared genes aren't truly invariant.
The median difference isn't necessarily a normalisation failure; it could reflect a biological reality (e.g., higher overall transcriptional activity in one sex for that tissue). But we need to rule out technical artifacts first.
Diagnostic Steps
Before tweaking normalisation, let's verify the raw data. Run these in R (assuming you have a DGEList object called dge from edgeR::DGEList()):
# 1. Check library sizes (total raw counts per sample)
barplot(dge$samples$lib.size, las=2, main="Library Sizes")
# Or per tissue/sex:
boxplot(split(dge$samples$lib.size, paste(dge$samples$tissue, dge$samples$sex)))
# 2. Raw count densities (pre-normalisation) – compare medians here
# Subset to your third tissue
dge3 <- dge[, dge$samples$tissue == "Tissue3"] # adjust name
logCPM_raw <- edgeR::cpm(dge3, log=TRUE)
boxplot(split(logCPM_raw[1, ], dge3$samples$sex), main="Raw logCPM Medians - Tissue3")
# 3. TMM factors – are they balanced per sex?
barplot(dge3$samples$norm.factors, las=2, main="TMM Normalisation Factors - Tissue3")
# If one sex has systematically higher/lower factors, that's your smoking gun.
# 4. Gene-wise dispersion/MAD to spot outliers (your sex-specific genes)
# Compute MAD of logCPM across samples, per gene
mad_per_gene <- apply(logCPM_raw, 1, mad)
hist(mad_per_gene, breaks=100, main="MAD of logCPM - Tissue3")
# High-MAD genes are likely your sex-biased ones; confirm with annotation (e.g., chrX/Y).
If raw medians already differ (step 2), it's probably biological—normalisation can't "fix" true global shifts. If TMM factors are imbalanced (step 3), composition bias is at play.
Potential Fixes
Stick with TMM but Filter Ruthlessly Upfront:
- Before
calcNormFactors(), filter to genes expressed in at least n samples across both sexes (e.g., CPM > 1 in >=2 samples per sex-group).
- Your "shared genes" idea is good; formalise it:
# Pseudo-code for third tissue
keep <- rowSums(cpm(dge3) > 1) >= 2 # Basic filter
# Plus: expressed in both sexes
cpm_by_sex <- t(aggregate(t(cpm(dge3)), by=list(sex=dge3$samples$sex), FUN=mean))
keep_shared <- rowSums(cpm_by_sex > 1) == 2 # Both sexes
dge3_filt <- dge3[keep & keep_shared, ]
dge3_filt <- edgeR::calcNormFactors(dge3_filt, method="TMM")
# Re-merge or re-run FPKM on full set using these factors
- This mimics a housekeeping set without hardcoding genes.
Try Alternative Normalisers:
- RUVSeq (for removing unwanted variation): Great for sex-composition bias. It uses negative control genes (your shared ones) to estimate factors.
library(RUVSeq)
# After filtering to shared genes as above
set <- newSeqExpressionSet(as.matrix(dge3_filt$counts),
phenoData=DataFrame(dge3_filt$samples))
set <- betweenLaneNormalization(set, which="upper") # Or TMM here
set <- RUVg(set, k=1, spikes=rownames(set)) # Use all shared as "spikes"
# Then proceed to FPKM
- sva or RUVs if you suspect hidden batches.
- DESeq2's
vst() or rlog(): These are robust to extreme DE and might give more stable transformed values than FPKM. Switch if you're downstream in DE analysis.
Length Normalisation Last:
- Do TMM first, then rpkm() or similar. But if medians shift post-FPKM, double-check gene lengths (e.g., via
biomaRt—ensure no sex-chr annotation errors).
Biological Check:
- Annotate top differing genes: Are they sex-linked? Plot total counts from chrX/Y.
- If it's real biology, embrace it! Report normalised within-sex for that tissue, or use a global median-ratio method.
Your plots suggest the first two tissues are fine, so maybe tissue-specific normalisation is the way (run calcNormFactors per tissue).
Kevin
New Nature paper: Makes and females are physiologically not the same. Joke aside, which tissue is number 3? A better diagnostic plot would be an MA-plot by the way and not using FPKM since this is not compatible with edgeR (or any) serious testing framework.
How do raw counts look? How many genes are 'detected' in each sample, perhaps using something like 10 counts threshold? Since tissue 3 has more sex-specific expression patterns, perhaps the overall expression is too different to be normalized like you expect. The assumption is that the majority of genes do not change expression. Can you confirm that assumption is true for this tissue?
Thanks for the replies,
Here are MAplots for the three tissues (TMM normalization, no FPKM, only autosomal genes included),
Third tissue is gonads, so yes I do expect (large) differences ... Yet, because of the litterature, I must admit that I was surprised to find this difference of median log2FC between sexes for autosomal genes,
There is between 8000 and 10000 of genes with more than 10 counts for each tissue,
"The assumption is that the majority of genes do not change expression. Can you confirm that assumption is true for this tissue?" From the MAplot I would say maybe not,
Arguably the most sex-specific tissue that exists, of course this looks different.
This sounds normal. You can prefilter using
filterByExpr(). Usually one finds about (crude ballpark estimate) 15k genes using this function in most cases I've seen.To me, everything looks perfectly fine, so normalization looks decent and the plots at the bottom, indicating, large changes, which is expected given the biology.