Question

Correlated standard errors in DESeq2

3

Entering edit mode

9.2 years ago

lams ▴ 30

Dear community,

We are analyzing RNA-seq data from a small time-series experiment with three groups each sampled at a different timepoint (timepoint-0, timepoint-1, timepoint-2) - hence no measurements were repeated on the same individuals. The design is unbalanced with few samples in each group.

To identify genes that vary over time, we encoded the time variable using two dummy variables and applied the likelihood-ratio test (with no coefficient shrinkage). However, for the purpose of visualization, we wanted to show the estimated log2 fold-changes (LFCs) along with their standard errors (SEs) for a model fit using coefficient shrinkage. As the SEs of the LFC estimates of the most significant genes from our test looked quite correlated, we plotted the SEs of the two estimates against each other for all genes and got (red line: intercept 0, slope 1):

The corresponding LFCs (for completeness):

We couldn't come up with an explanation for the strong (and weirdly looking) SE correlation, so we would like to know:

Is this how it should be? (And out of curiosity, why is this so?)
Does is have any implications for the validity of our analysis?

Thanks!

DESeq2 RNA-Seq • 2.9k views

ADD COMMENT • link updated 24 months ago by Ram 43k • written 9.2 years ago by lams ▴ 30

0

Entering edit mode

How many samples are in each timepoint? When you say "few samples" in each group, is that 1 or five?

ADD REPLY • link updated 24 months ago by Ram 43k • written 9.2 years ago by karl.stamm 4.1k

Ram · Accepted Answer · 2015-02-20

Probably the normal behavior of the algorithm estimating the lfcSE.

Consider the simple case of a Binomial random variable. That's a coin-toss with probability p of a Heads. We know the expected count of heads after n flips is n times p, and the variability around that result is n times p times (1-p). My point is the variability is a simple function of the parameters p and n. Probably DESeq is doing something similar in your experimental setup, where the lfc-SE is derived from the data modeling it's doing. The variability modeling steps in DESeq will use the whole dataset and share information about the dispersion seen in each transcript, so you should expect some kind of relationship. Probably in a more controlled experiment you'd see a less clear relationship, but now it's doing what it can on limited information.

Since timepoint 1 has strictly lower SEs, it looks like that's the condition you have more replicates of. If you need to know how the numbers are determined, the source-code is available to look into. I wouldn't worry about it.

Edit: Worth looking into my own data for the same phenomenon. When comparing two samples in three groups (six total samples) my lfcSE plot is on the diagonal red line. No crazy loop, but the very coherent correlation is present. I'll stand by my claim that whatever makes one gene have high SE in one comparison also influences it in the other comparison, particularly since we're using the same third comparison group (your timepoint zero). I'll look into another experiment with more than 2v2 samples to see if the loop structure comes out at all.

Edit2: Here's the results on two experiments. The top row is two samples, symmetric Ax2 vs Bx2 and Ax2 vs Cx2. The bottom row is a varied dataset, Ax20 vs Bx6 and Bx6 vs Cx14. You'll see the correlation is a little more fuzzy but still strongly determined by the gene itself. Of note my SE are almost 10 times as high as yours. Could you post some code so we can see what analysis exactly was performed?

Correlation Figures