Question

High logfold change but padj > 0.1 in DESEQ2

1

Entering edit mode

4.0 years ago

thjnant ▴ 160

Hello,

I am analysing RNA-seq data to investigate differential gene expression in hybrids compared to parental species. Since I work with natural populations, I have few samples (5 of two different tissues for each species and hybrid).

I am using the DESEQ2 package for my expression analysis. What I observe is that many genes, although they have a high log fold change (more than 1 or even 1.5), they have a padj > 0.1. While this is true in one group, in another group, genes with a log fold change of 1 or even lower are having a padj < 0.1 or even padj < 0.05.

I was wondering what are the reasons for this observation?

Thank you in advance.

Deseq2 expression RNA-Seq R • 1.9k views

ADD COMMENT • link updated 4.0 years ago by ATpoint 82k • written 4.0 years ago by thjnant ▴ 160

2

Entering edit mode

If you plot the normalized expression levels of those genes for each condition, you might understand why.

ADD REPLY • link 4.0 years ago by GouthamAtla 12k

score 5 · Accepted Answer · 2020-04-30

Fold changes tend to be higher when genes have overall lower expression (which means low counts). Since low counts have lower power than high counts the significance for these fold changes is often low unless these FCs are supported by many replicates.

Example 1: Two genes had expression of 50 an 5. That would be a fold change of 10.

Example 2: Two genes had expression of 5000 an 500. That would be a fold change of 10 as well.

Still, the second one is much more reliable as the first one could be a product of the technical noise produced by the sequencing. Adding or reducing e.g. 10 counts to example a can change the result quite much:

50 - 10 vs 5 + 5 would already change the original FC from 10 to 2.6 whereas

5000 - 10 - 500 + 5 changes the FC from 10 to 9.88.

You can see that higher counts are less affected by small fluctuations in counts, therefore they are more reliable. In DESeq2 you can check the baseMean column to get the average expression. This is probably low for many of these genes with high FCs but large padj. You can visualize this relationship of baseMean to logFC with the plotMA function.

This is where the concept of shrinkage kicks in. It aims to estimate the "true" fold changes from the data. As you can see below there is little evidence for the fold changes of the genes with low baseMean to be actually true, so they are shrunken towards zero. If you want lowly-expressed genes to be significant 8given they in fact are DEG) then you need most importantly many replicates and high sequencing depth.

Check the DESeq2 vignette for it.

Some examples:

Unshrunken FCs:

enter image description here

Shrunken:

enter image description here