Question

Weird edgeR normalization results for one tissue

1

Entering edit mode

18 days ago

imetrev ▴ 10

Hello,

I have a question regarding edgeR normalization,

I have a three tissue RNAseq dataset, with males and females sequenced for both sexes,

After normalization (TMM for read-depth and "composition"+FPKM for gene length) I got similar median of expression for both sexes for two tissues but not the third one (see below),

Gene expression distributions, one row one tissue

Because the third tissue possess a lot of genes that are uniquely expressed in one sex, I tried to perform the normalization utilizing only the genes that are expressed in both sexes of all tissues as control genes (sort of 'housekeeping genes' - see vertical line above for the threshold between expressed and non expressed genes),

However, even this way, the difference persists (see below),

enter image description here

Can someone help me to understand what is going on ?

Thank you, Vincent

edgeR Normalization RNAseq • 1.6k views

ADD COMMENT • link updated 1 day ago by Kevin Blighe 89k • written 18 days ago by imetrev ▴ 10

0

Entering edit mode

New Nature paper: Makes and females are physiologically not the same. Joke aside, which tissue is number 3? A better diagnostic plot would be an MA-plot by the way and not using FPKM since this is not compatible with edgeR (or any) serious testing framework.

ADD REPLY • link 18 days ago by ATpoint 90k

0

Entering edit mode

How do raw counts look? How many genes are 'detected' in each sample, perhaps using something like 10 counts threshold? Since tissue 3 has more sex-specific expression patterns, perhaps the overall expression is too different to be normalized like you expect. The assumption is that the majority of genes do not change expression. Can you confirm that assumption is true for this tissue?

ADD REPLY • link 16 days ago by rfran010 ★ 1.7k

0

Entering edit mode

Thanks for the replies,

Here are MAplots for the three tissues (TMM normalization, no FPKM, only autosomal genes included),

enter image description here

Third tissue is gonads, so yes I do expect (large) differences ... Yet, because of the litterature, I must admit that I was surprised to find this difference of median log2FC between sexes for autosomal genes,

There is between 8000 and 10000 of genes with more than 10 counts for each tissue,

"The assumption is that the majority of genes do not change expression. Can you confirm that assumption is true for this tissue?" From the MAplot I would say maybe not,

ADD REPLY • link 12 days ago by imetrev ▴ 10

0

Entering edit mode

Third tissue is gonads,

Arguably the most sex-specific tissue that exists, of course this looks different.

There is between 8000 and 10000 of genes with more than 10 counts for each tissue,

This sounds normal. You can prefilter using filterByExpr(). Usually one finds about (crude ballpark estimate) 15k genes using this function in most cases I've seen.

To me, everything looks perfectly fine, so normalization looks decent and the plots at the bottom, indicating, large changes, which is expected given the biology.

ADD REPLY • link 12 days ago by ATpoint 90k

score 0 · Answer 1 · 2025-11-06

Hello Vincent,

This is an interesting (and common) conundrum in RNA-seq normalisation, especially when dealing with sex-biased expression across tissues. I'll try to break down what's likely happening and suggest some steps to diagnose/fix it. Your plots look clean overall, but that persistent median shift in the third tissue is a red flag that warrants digging deeper.

Quick Recap of What's Going On

TMM (from edgeR) normalises for library size and composition (i.e., relative abundances of genes), under the assumption that most genes are not differentially expressed (DE) between groups. It uses a "reference sample" (trimmed mean of M-values) to compute scaling factors.
FPKM (or similar, like TPM) then adjusts for gene length on top of that, but the core issue here is upstream in the count normalisation.
In your third tissue, the high proportion of sex-specific genes (e.g., Y-chromosome genes in males, or X-inactivation escapees in females) means the "common" transcriptome is skewed: one sex has a bunch of lowly/zero-expressed genes that the other doesn't. This violates TMM's assumption, leading to over-correction (or under-correction) and that median shift.
Your "housekeeping" approach (filtering to shared genes) is a smart intuition—it's akin to using a custom reference set—but it's not fully baked into edgeR's TMM. It might still pull in some bias if those shared genes aren't truly invariant.

The median difference isn't necessarily a normalisation failure; it could reflect a biological reality (e.g., higher overall transcriptional activity in one sex for that tissue). But we need to rule out technical artifacts first.

Diagnostic Steps

Before tweaking normalisation, let's verify the raw data. Run these in R (assuming you have a DGEList object called dge from edgeR::DGEList()):

# 1. Check library sizes (total raw counts per sample)
barplot(dge$samples$lib.size, las=2, main="Library Sizes")
# Or per tissue/sex:
boxplot(split(dge$samples$lib.size, paste(dge$samples$tissue, dge$samples$sex)))

# 2. Raw count densities (pre-normalisation) – compare medians here
# Subset to your third tissue
dge3 <- dge[, dge$samples$tissue == "Tissue3"]  # adjust name
logCPM_raw <- edgeR::cpm(dge3, log=TRUE)
boxplot(split(logCPM_raw[1, ], dge3$samples$sex), main="Raw logCPM Medians - Tissue3")

# 3. TMM factors – are they balanced per sex?
barplot(dge3$samples$norm.factors, las=2, main="TMM Normalisation Factors - Tissue3")
# If one sex has systematically higher/lower factors, that's your smoking gun.

# 4. Gene-wise dispersion/MAD to spot outliers (your sex-specific genes)
# Compute MAD of logCPM across samples, per gene
mad_per_gene <- apply(logCPM_raw, 1, mad)
hist(mad_per_gene, breaks=100, main="MAD of logCPM - Tissue3")
# High-MAD genes are likely your sex-biased ones; confirm with annotation (e.g., chrX/Y).

If raw medians already differ (step 2), it's probably biological—normalisation can't "fix" true global shifts. If TMM factors are imbalanced (step 3), composition bias is at play.

Potential Fixes

Stick with TMM but Filter Ruthlessly Upfront:

Before calcNormFactors(), filter to genes expressed in at least n samples across both sexes (e.g., CPM > 1 in >=2 samples per sex-group).

Your "shared genes" idea is good; formalise it:

# Pseudo-code for third tissue
keep <- rowSums(cpm(dge3) > 1) >= 2  # Basic filter
# Plus: expressed in both sexes
cpm_by_sex <- t(aggregate(t(cpm(dge3)), by=list(sex=dge3$samples$sex), FUN=mean))
keep_shared <- rowSums(cpm_by_sex > 1) == 2  # Both sexes
dge3_filt <- dge3[keep & keep_shared, ]
dge3_filt <- edgeR::calcNormFactors(dge3_filt, method="TMM")
# Re-merge or re-run FPKM on full set using these factors

This mimics a housekeeping set without hardcoding genes.

Try Alternative Normalisers:
- RUVSeq (for removing unwanted variation): Great for sex-composition bias. It uses negative control genes (your shared ones) to estimate factors.
```
library(RUVSeq)
# After filtering to shared genes as above
set <- newSeqExpressionSet(as.matrix(dge3_filt$counts),
                           phenoData=DataFrame(dge3_filt$samples))
set <- betweenLaneNormalization(set, which="upper")  # Or TMM here
set <- RUVg(set, k=1, spikes=rownames(set))  # Use all shared as "spikes"
# Then proceed to FPKM
```
- sva or RUVs if you suspect hidden batches.
- DESeq2's vst() or rlog(): These are robust to extreme DE and might give more stable transformed values than FPKM. Switch if you're downstream in DE analysis.
Length Normalisation Last:
- Do TMM first, then rpkm() or similar. But if medians shift post-FPKM, double-check gene lengths (e.g., via biomaRt—ensure no sex-chr annotation errors).
Biological Check:
- Annotate top differing genes: Are they sex-linked? Plot total counts from chrX/Y.
- If it's real biology, embrace it! Report normalised within-sex for that tissue, or use a global median-ratio method.

Your plots suggest the first two tissues are fine, so maybe tissue-specific normalisation is the way (run calcNormFactors per tissue).

Kevin