Question: DESeq2 unequal sample sizes
0
gravatar for glados
5 weeks ago by
glados20
glados20 wrote:

Hi,

I'm using DESeq2 to look for differentially expressed genes. I have three groups with different individuals: Treatment A, Treatment B, and Control.

Sample sizes (individuals) differ between the groups: A=9, B=5, Control=5.

If I want to test A directly against B, the sample size difference is of no problem, I understand that (asked already in this question: https://www.biostars.org/p/90421/).

But, when I want to test A vs Control and see how that differs from B vs Control, I worry about the unequal sample sizes. I expect the A vs Control comparison to find slightly more significant genes just because of more accurate dispersion.

How should I deal with this problem? Is 9 vs 5 not a big enough difference to worry about, or should I randomly subsample 5 individuals from treatment A (i.e. discard 4) and only use those in the comparison with the Controls?

Grateful for any advice.

deseq deseq2 R • 223 views
ADD COMMENTlink modified 4 weeks ago by Kevin Blighe3.4k • written 5 weeks ago by glados20
2
gravatar for Kevin Blighe
4 weeks ago by
Kevin Blighe3.4k
Republic of Ireland (Éire)
Kevin Blighe3.4k wrote:

Yes, when the numbers are unequal in this way, you will see many more genes passing your FDR Q value threshold. I faced this situation 2 years ago but for even more extreme/unequal sample distributions. I think that it's reasonable to use difference thresholds for each comparison.

Here is some code that allows you to visually check how FDR affects different comparisons by showing how many significant variables remain at each level of FDR - it also gives the nominal P-value equivalent at each FDR. in this way, you can choose a good cut-off.

options(scipen=999)

topT <- as.data.frame(AVsControl)
AVsControl.pvalues <- topT$pvalue[!is.na(topT$pvalue)]

#5% FDR
alpha <- 0.05
AVsControl.cutoff0.05 <- max(topT$pvalue[topT$padj<alpha], na.rm=TRUE)

#1% FDR
alpha <- 0.01
AVsControl.cutoff0.01 <- max(topT$pvalue[topT$padj<alpha], na.rm=TRUE)

et cetera...

pdf("RankPValue.pdf")
        par(mar=c(5,8,5,8), cex=0.8, cex.main=0.8, cex.lab=0.8, cex.axis=0.8, mfrow=c(1,2))

        #A versus Control
            orderInPlot <- order(AVsControl.pvalues)
            showInPlot <- (AVsControl.pvalues[orderInPlot] <= 0.05)
            alpha <- 0.05
            if (any(showInPlot))
            {
                plot(seq(along=which(showInPlot)), AVsControl.pvalues[orderInPlot][showInPlot], type="l", pch=".", main=paste("Significance cut-off @ 5% FDR = \n", round(AVsControl.cutoff0.05,9)), xlab=expression(paste("Rank ", italic("P"), " value (num. sig. genes)")), ylab=expression(paste(italic("P"), " value")))
                abline(a=0, b=alpha/length(AVsControl.pvalues), col="red3", lwd=1)
            }

            orderInPlot <- order(AVsControl.pvalues)
            showInPlot <- (AVsControl.pvalues[orderInPlot] <= 0.01)
            alpha <- 0.01
            if (any(showInPlot))
            {
                plot(seq(along=which(showInPlot)), AVsControl.pvalues[orderInPlot][showInPlot], type="l", pch=".", main=paste("Significance cut-off @ 1% FDR = \n", round(AVsControl.cutoff0.01,9)), xlab=expression(paste("Rank ", italic("P"), " value (num. sig. genes)")), ylab=expression(paste(italic("P"), " value")))
                abline(a=0, b=alpha/length(AVsControl.pvalues), col="red3", lwd=1)
            }
dev.off()

Screenshot

ADD COMMENTlink modified 12 days ago • written 4 weeks ago by Kevin Blighe3.4k

Kevin, thank you so much for your reply and code! Very appreciated. I did an analysis using only 5 subsampled individuals in groupA and I see about 40% fewer significant genes so it does indeed have a dramatic effect on the results.

I had not yet considered to use a stricter FDR threshold for groupA, though that makes a lot of sense. Thank you for suggesting it. That would allow me to keep all the individuals and the test will therefore become more accurate.

I did the plots based on your code, but have a slight problem understanding what they mean. Indeed, the red line cuts the curve at fewer genes for each lowered threshold (about 1500 genes at FDR 10%, 1000 at 5%, and about 500 at FDR 1%). But how does this inform me of what FDR to choose? Do I compare this somehow to the groupB results? Thank you for your advice.

ADD REPLYlink written 4 weeks ago by glados20

Hey glados,

Yes apologies for not elaborating further. These plots assisted me because, with them, I was able to quickly generate dozens of plots at different FDR thresholds and determine the number of genes passing FDR at each threshold. This was useful because we wanted a roughly even number of genes from each comparison for further downstream analyses.

The plots also help in a manuscript situation whereby, if you represent your different cut-offs visually, it provides just that little bit of extra assurance to reviewers that you have thought very deeply about the sample imbalance and how to best extract differentially expressed genes from your comparisons.

ADD REPLYlink written 4 weeks ago by Kevin Blighe3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1437 users visited in the last hour