Question

Why is DESeq2 normalization making my top feature have identical values across samples?

0

Entering edit mode

19 days ago

DNAngel ▴ 250

Hi all,

I'm using DESeq2 to normalize my counts dataset which has about 90 samples and 252 taxa. I will need this for WGCNA analyses which I have done many times before with species abundance datasets without issue.

However this time, I'm having such a weird problem that I cannot understand what is the underlying meaning. I have two datasets, bacteria and fungi and normally I will combine this to give me one complete microbiome dataset.

I have already identified top species (i.e., based on abundance, and their contribution scores).

What's strange here is that when I normalize the datasets separately, the numbers are okay and nothing looks weird. But when I normalize the combined bacteria+fungi dataset, no matter if i use estimateSizeFactors followed by estimateDispersions and then nbinomWaldTest, or if I use varianceStabilizingTransformation, the normalized count matrix keeps making my top species, a fungus, have identical values across all samples. This utlimately means it gets removed during WGCNA analysis during the cleaning steps.

Why is this happening?

Below are my different codes I've used to normalize my data.

  data_env <- data[,c(1:5)] # Environmental and sample info
  data_sp <- data[,-c(1:5)] # taxa counts
  data_sp.counts <- as.data.frame(t(data_sp)) # convert it so that taxa are rows, samples are columns

 data_env.coldata <- data.frame(rows=colnames(data_sp.counts), condition=as.factor(data$Species)) # where Species are the 10 different flower species sampled across 80 sites).
 data_env.coldata$rows <- as.character(data_env.coldata$rows)
 data_env.coldata$condition <- as.factor(data_env.coldata$condition)

# Normalization method 1
dds <- DESeqDataSetFromMatrix(countData = data_sp.counts, 
                          colData = data_env.coldata, 
                          design = ~ condition)

dds <- DESeq(dds)
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds) 

normalized_counts <- counts(dds, normalized=TRUE)


# Normalization method 2
 dds <- DESeqDataSetFromMatrix(countData = data_sp.counts, 
                          colData = data_env.coldata, 
                          design = ~ condition)
vsd <- varianceStabilizingTransformation(dds, blind=FALSE)
mat <- assay(vsd)

Either case, top fungus has almost identical values (they aren't identical if you look at every significant digit, but if rounded it is identical) and gets removed for WGCNA.

DESEQ2 WGCNA • 333 views

ADD COMMENT • link 18 days ago by DNAngel ▴ 250

score 0 · Answer 1 · 2024-04-11

0

Entering edit mode

18 days ago

LChart 3.9k

This could be an edge case that happens when all or nearly all genes have a sample with a 0 count - this can distort the size factors estimate. What happens if you run estimateSizeFactors(dds, type='iterate')?

ADD COMMENT • link 18 days ago by LChart 3.9k

0

Entering edit mode

I got this warning when I ran that:

Error in estimateSizeFactorsIterate(object) : 
  iterative size factor normalization did not converge

ADD REPLY • link 18 days ago by DNAngel ▴ 250

1

Entering edit mode

Actually, calling estimateSizeFactors(dds, type="poscounts") worked. I didn't realize this was an option and it seems to work with 0 inflated data and good for datasets where there is a 0 in a sample in every gene/feature/taxa.

ADD REPLY • link 18 days ago by DNAngel ▴ 250