Question: Correlated standard errors in DESeq2
3
lams30 wrote:

Dear community,

We are analyzing RNA-seq data from a small time-series experiment with three groups each sampled at a different timepoint (timepoint-0, timepoint-1, timepoint-2) - hence no measurements were repeated on the same individuals. The design is unbalanced with few samples in each group.

To identify genes that vary over time, we encoded the time variable using two dummy variables and applied the likelihood-ratio test (with no coefficient shrinkage). However, for the purpose of visualization, we wanted to show the estimated log2 fold-changes (LFCs) along with their standard errors (SEs) for a model fit using coefficient shrinkage. As the SEs of the LFC estimates of the most significant genes from our test looked quite correlated, we plotted the SEs of the two estimates against each other for all genes and got (red line: intercept 0, slope 1): The corresponding LFCs (for completeness): We couldn't come up with an explanation for the strong (and weirdly looking) SE correlation, so we would like to know:

• Is this how it should be? (And out of curiosity, why is this so?)
• Does is have any implications for the validity of our analysis?

Thanks!

rna-seq deseq2 • 1.9k views
modified 5.4 years ago • written 5.4 years ago by lams30

How many samples are in each timepoint? When you say "few samples" in each group, is that 1 or five?

6
karl.stamm3.6k wrote:

Probably the normal behavior of the algorithm estimating the lfcSE.

Consider the simple case of a Binomial random variable. That's a coin-toss with probability p of a Heads. We know the expected count of heads after n flips is n times p, and the variability around that result is n times p times (1-p). My point is the variability is a simple function of the parameters p and n. Probably DESeq is doing something similar in your experimental setup, where the lfc-SE is derived from the data modeling it's doing. The variability modeling steps in DESeq will use the whole dataset and share information about the dispersion seen in each transcript, so you should expect some kind of relationship. Probably in a more controlled experiment you'd see a less clear relationship, but now it's doing what it can on limited information.

Since timepoint 1 has strictly lower SEs, it looks like that's the condition you have more replicates of.  If you need to know how the numbers are determined, the source-code is available to look into. I wouldn't worry about it.

Edit: Worth looking into my own data for the same phenomenon. When comparing two samples in three groups (six total samples) my lfcSE plot is on the diagonal red line. No crazy loop, but the very coherent correlation is present. I'll stand by my claim that whatever makes one gene have high SE in one comparison also influences it in the other comparison, particularly since we're using the same third comparison group (your timepoint zero). I'll look into another experiment with more than 2v2 samples to see if the loop structure comes out at all.

Edit2: Here's the results on two experiments. The top row is two samples, symmetric Ax2 vs Bx2 and Ax2 vs Cx2. The bottom row is a varied dataset, Ax20 vs Bx6 and Bx6 vs Cx14. You'll see the correlation is a little more fuzzy but still strongly determined by the gene itself. Of note my SE are almost 10 times as high as yours. Could you post some code so we can see what analysis exactly was performed? 1

Karl is right. When you have very few samples (few residual degrees of freedom) the information to estimate the dispersion is small, so the final dispersion estimates are often close to the parametric line, which is a function of the mean of all normalized samples. The standard errors are calculated using the dispersions, so this dependence on mean will also be seen in the standard errors. Because the groups are unbalanced, one of the groups is contributing more to the mean of normalized samples than the others. Also, at very small mean, the LFC and standard errors start looking like a function of the mean, as the LFC shrinkage increases at low counts. If you color the SE plot with something like col=ifelse(res\$baseMean > 5,"black","red"), you will see that the curve of points is just genes which have very small counts, and the curve is just an artifact of these properties mentioned above. These genes have very small power for detection of DE so it's not a problem for inference.

Although with really unbalanced design, it might make sense to not use a parametric function of the mean of normalized counts from all samples (fitType="parametric"), but instead just use a common value across all genes to set the dispersion prior:

`DESeq(dds, fitType="mean")`

Then the dispersion estimation is not affected by the mean of normalized counts.