I'm using DESeq2 to find differentially expressed genes between two conditions from RNAseq data, with lots of replicates (46 in condition "1", 20 in condition "2").
I get results with significative adjusted p-values, but for most of them the gene expression values are highly variable between replicates.
For example for the gene with the lowest adjusted p-value, I've got all samples from both conditions with low normalized counts (around 10), and just one sample in one condition with >200000 normalized counts, which drives the differential expression toward this condition.
See log2(normalized counts + 1) boxplot below ( the adjusted p-value is 8.05e-12, and the log2FC is -5.87 between condition "1" and "2" for this gene)
Here is the code I used :
dds <- DESeqDataSetFromTximport(tx_import_data, coldata, ~condition) keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] dds$condition <- relevel(dds$condition, ref = "R") dds <- DESeq(dds) res05 <- results(dds, alpha=0.05)
I'm wondering if this is "normal" that DESeq2 keeps those kinds of results and I that should filter it if I find it irrelevant, of if I made some mistake during the process and that DEseq2 should only keep genes without such expression dispersion between replicates?
Thank for your help