How to interpret the output from a DESeq2 WALD test analysis in R to compare gene expression in two different tissues.
Entering edit mode
7 weeks ago


I am trying to see which genes are equally (or conversely, differentially) expressed across two tissue types: Artery and Blood. I need to exclude those which are similarly expressed from my study. I wish to use the Wald test data to help me.

These are the results of the DeSEQ analysis:

Question: For each gene how do I interpret the "stat" and/or other values to accomplish this? I understand everything bar the "stat" value. Is there a specific "stat" value range, or direction I should use as a cut-off? I'm really struggling to make sense of this.

log2 fold change (MLE): Tissue_Type Blood vs Artery 
Wald test p-value: Tissue Type Blood vs Artery 
DataFrame with 10147 rows and 6 columns
                      baseMean log2FoldChange     lfcSE      stat       pvalue         padj
                     <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
ENSG00000000003.14    33.81022      -4.099087 0.0547015 -74.93560  0.00000e+00  0.00000e+00
ENSG00000000419.12   115.26431      -2.223858 0.0324612 -68.50819  0.00000e+00  0.00000e+00
ENSG00000000457.13    18.71515       1.179049 0.0329148  35.82125 5.15710e-281 1.08739e-280
ENSG00000000460.16     4.40886       0.235817 0.0449273   5.24886  1.53041e-07  1.68755e-07
ENSG00000000938.12  1684.10848       7.874469 0.0523470 150.42842  0.00000e+00  0.00000e+00
...                        ...            ...       ...       ...          ...          ...
ENSG00000284237.1  3.55911e-01      -3.075739 0.1297507 -23.70499 3.20315e-124 5.16753e-124
ENSG00000284308.1  1.74974e+01      -0.393802 0.0494972  -7.95603  1.77644e-15  2.06777e-15
ENSG00000284413.1  1.46580e+01      -3.658038 0.0699289 -52.31082  0.00000e+00  0.00000e+00
ENSG00000284484.1  5.86535e+04      -1.767739 0.0949798 -18.61173  2.58128e-77  3.73497e-77
ENSG00000284526.1  9.62264e+00       4.137284 0.0958993  43.14196  0.00000e+00  0.00000e+00

Many thanks.

Wald gene-expression-comparison DESeq2 • 734 views
Entering edit mode

Typically, the info about the output of funtions is in the function manual, under the "value" heading. For this object, you can see it here:

For the Wald test, stat is the Wald statistic: the log2FoldChange divided by lfcSE.

This is called the "test statistic". It is like a summary of your data, something that is computed from your data, so it can be used to check against a null distribution to see how much your result deviates from what would be expected under the "null hypothesis" (see wiki). When you do this comparison you obtain a p-value which indicates you the probability of having observed your data if the null hypothesis/assumptions were true (e.g. if there were no differences between groups).

Nonetheless, for the filtering/selection of significant genes one typically filters based on p-value and logFC magnitude.

Entering edit mode

Papyrus thank you very much.

That was a a good answer and it has helped my understanding.

Assuming I have screened out results < 0.05 padj level, what would you think to be a reasonable cut-off range for log2FoldChange value if I wanted similarly expressed genes identified?

Ultimately I will be comparing across 3 different tissue types. Ultimately I plan to compare A==B, A==C, B==C and screen out that way.

Entering edit mode

Well with these tests one usually focuses on identifying differentially expressed genes, I guess that you could say that the rest of genes are similarly expressed (or rather, you don't have evidence of them being differentially expressed). For the selection of differentially expressed, the logFC depends on what you think is a biologically meaningful change, etc (see answers like this one). If conversely you want to focus on genes with no changes across groups (not differentially expressed), maybe you would want more (less?) conservative and remove more differentially expressed genes, or also look at genes with low variance across all groups, etc.

Entering edit mode

Papyrus thank you. This is really more useful than you could know!

So if I have 3 tissues, A B C.

and look at LFC for them pairwise by running

  ...DeSEQ..   contrast=c("X","Z") etc
  • the first tissue in each being the control.

And get LFC values (e.g.) of

B-A : -4

C-A: +2

B-C: +6

that might indicate C is being expressed most of all, then B, then A? I have noticed they add up with actual data.

Is my interpretation correct?

Entering edit mode
7 weeks ago

A gene having a p-value > 0.05 does not mean that it has the same expression in two conditions, just that there is not enough evidence to conclude that it is definitely different. This is compounded by the fact that we tend to need very strong evidence of a gene being differential before we say that is significant. Thus, if you just select genes that have p-value>0.05, you will be getting many genes that are different, as well as ones that arn't.

DESeq does allow you to test against a non-standard alternative hypothesis. The standard alt hypothesis is |LFC| > 0. This makes the null hypothesis LFC==0 (i.e. the inverse of the alt hypothesis). You can tell DESeq to use an alt-hypothesis |LFC|<x, where x is some LFC threshold. This will select for genes where we are confident that the gene has not changed between conditions by more than x. If you set x to 0.58, then you will select for genes that have definitely changed less than 1.5 fold.

What you do will be determined by the biology at hand. If only keep genes where adjusted p-value (padj)<0.05 under the standard alt hypothesis, you will be sure not to include any genes that don't change in your downstream analysis, but at the risk of exclude many genes that do change.

If you only discard genes where padj<0.05 under the |LFC|<0.58 alt-hypothesis, you will be sure to only discard genes you are sure don't change, and keep those where there is any chance they change, but at the risk of keeping some that don't change.

You need to decide if it is better to risk losing some genes that are differential, or keeping some genes that don't change.

Entering edit mode

Ian - thank you very much for that insight which is really very useful indeed. Especially on a Sunday! I am sure it will be useful to many more than just myself.

This is an exploration and descriptive exercise rather than an inferential one. I have a gene set of approx 10k genes, so I'm happy to lose true positives in the quest to eliminate false positives.

My thinking on this is developing as I read these responses.

The ultimate goal is to identify which genes are both equally expressed and up-regulated in A-Tissue and B-Tissue, but not expressed in C-Tissue.

So, I have been exploring LFC2 values arising from the contrast function info to identify which genes they may be. C-Tissue is my control.

If I understand this correctly (please correct me if I don't) :

 TEST [1]

 with Gene1 contrast : A-Tissue - C-Tissue :

 <0 fold change indicates Gene1 in C-Tissue is under expressed, and in A-Tissue is over expressed.
 >0 fold change indicates Gene1 C-Tissue is over expressed, and in A-Tissue is under expressed.

 TEST [2]

 with Gene1 contrast : B-Tissue - C-Tissue :

 <0 fold change indicates Gene1 in C-Tissue is under expressed, and in B-Tissue is over expressed.
 >0 fold change indicates Gene1 in C-Tissue is over expressed, and in B-Tissue is under expressed.

 TEST [3]

 with Gene1 contrast : B-Tissue - A-Tissue :

 <0 fold change indicates A-Tissue is under expressed, and B-Tissue is over expressed.
 >0 fold change indicates A-Tissue is over expressed, and B-Tissue is under expressed.
 However, in test 3, for my purposes, it is irrelevant for reasons below.


 TEST [3] - a negative or positive doesn't matter because I want there to be little difference between A-Tissue and B-Tissue - I need the LFC2 to be between  -1 and +1.

 TEST [1] - need a negative fold change - with  LFC2 being less than -1.5 - showing A expressed, C is not.
 TEST [2] - need a negative fold change - with  LFC2 being less than -1.5 - showing B expressed, C is not.

 ANSWER : This will tell me the Gene 1 is is equally(ish) over expressed in A-Tissue and B-Tissue, but not in C-Tissue. 

BUT - using the opposite (positive TEST 1 and TEST 2) LFC2 values, I could identify whether a gene is is normally expressed in A-Tissue and B-Tissue, but under expressed in C-Tissue. But as we are talking about log values, this would effectively have the same functional meaning as the first answer above??

I was planning on using 0.01 as a padj value, perhaps even 0.001 if I find I till have a lot of genes in my sample after the above filtering.

Do I have this right? I come from a social science stats/minitab background R and DESeq is a steep learning curve for me.


Login before adding your answer.

Traffic: 2668 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6