[DESeq2] time series analysis. Linear and squared model
1
2
Entering edit mode
7.4 years ago
john ▴ 30

Hello guys,

I am doing some RNA-seq analysis with DESeq2. Time course 0-48 h with one treatment of the cells. 0h is not treated and no replicates do exist.

I want to find out whether there is a linear response between time and gene expression (where time is considered to be a continuous covariate).

This is what I did so far:

>ddsfm <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ time)
h0_1 h2_2 h4_3 h6_4 h8_5 h10_6 h12_7 h14_8 h24_9 h48_10
A1BG-AS1   17   12   19   47   28     4    12    36    31    114
A1CF        0    1    2    2    4     0     0     5     3     22
A2M         3    4    6   18   11    20    20    18    17     60
A2M-AS1     0    3    1    6    0    10     4     9     2     12
A2ML1       0    0    3    4    6     2     2     2     3      7
A2MP1       9   20   34  144   97     5    35    72    69    112

> coldata
intercept time
h0_1           1   h0
h2_2           1   h2
h4_3           1   h4
h6_4           1   h6
h8_5           1   h8
h10_6          1  h10
h12_7          1  h12
h14_8          1  h14
h24_9          1  h24
h48_10         1  h48

> dds=DESeq(ddsfm,test="LRT",reduced= ~ 1, fitType='local')

> resultsNames(dds)
[1] "Intercept"      "time_h10_vs_h0" "time_h12_vs_h0" "time_h14_vs_h0"
[5] "time_h2_vs_h0"  "time_h24_vs_h0" "time_h4_vs_h0"  "time_h48_vs_h0"
[9] "time_h6_vs_h0"  "time_h8_vs_h0"


As far as I understood, by calling

> res=results(dds,name="Intercept")


I get the genes that respond to time. Correct?

So I just want to know if this is fine so far. Any thoughts are greatly appreciated.

The results look like that.

> resOrdered <- res[order(res$pvalue),] > head(resOrdered) log2 fold change (MLE): Intercept LRT p-value: '~ time' vs '~ 1' DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> LZTR1 46.28472 2.895392 2.466350 15.41700 0.08010024 0.9998715 FLJ37453 128.43150 5.558342 2.031662 15.32917 0.08228201 0.9998715 HMBS 47.71543 2.310480 2.494872 15.25735 0.08410516 0.9998715  Cheers John Well it seems like that spanish guy from here comparing each time point with deseq2 and their results had a similar question. Also since my p-values are not really low it seems like time has no influence on the gene's expression. rna-seq deseq2 • 4.4k views ADD COMMENT 1 Entering edit mode Keep in mind that your model isn't looking for a linear affect of time on gene expression. If that's really what you want (it's what you wrote), then you need time as a numeric covariate rather than a factor. ADD REPLY 0 Entering edit mode Hi Ryan, Thanks for your response. I was basically following the guy here DESEq2 warning while using GLM fitting > colData(dds) DataFrame with 10 rows and 3 columns intercept time sizeFactor <factor> <factor> <numeric> h0_249_1 1 h0 0.4031904 h2_249_2 1 h2 0.5897333 h4_249_3 1 h4 0.9342609 h6_249_4 1 h6 1.9014400 h8_249_5 1 h8 0.9675656 h10_249_6 1 h10 1.0818049 h12_249_7 1 h12 0.6057505 h14_249_8 1 h14 1.2609993 h24_249_9 1 h24 0.9578546 h48_249_10 1 h48 3.7064809  Now I have got: resultsNames(dds) [1] "Intercept" "time" > colData(dds) DataFrame with 10 rows and 3 columns intercept time sizeFactor <numeric> <numeric> <numeric> h0_249_1 1 0 0.4031904 h2_249_2 1 2 0.5897333 h4_249_3 1 4 0.9342609 h6_249_4 1 6 1.9014400 h8_249_5 1 8 0.9675656 h10_249_6 1 10 1.0818049 h12_249_7 1 12 0.6057505 h14_249_8 1 14 1.2609993 h24_249_9 1 24 0.9578546 h48_249_10 1 48 3.7064809  Also my results seem different now: > resOrdered <- res[order(res$padj),]
log2 fold change (MLE): time
LRT p-value: '~ time' vs '~ 1'
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange       lfcSE      stat       pvalue
<numeric>      <numeric>   <numeric> <numeric>    <numeric>
KRT8     1612.214997    -0.05406463 0.008487662  32.71247 1.068497e-08
LOC1720     3.621876     0.13374204 0.029702967  30.17952 3.938465e-08
WDR65      55.149914    -0.05988746 0.011911142  22.82437 1.775008e-06
ANO5        2.897376     0.11646892 0.031397022  19.90285 8.147898e-06
ARRDC4      2.519225     0.11579147 0.030950616  19.37764 1.072556e-05
HLA-DPA1   69.633022     0.18477908 0.038431829  18.44600 1.747874e-05
<numeric>
KRT8     8.469979e-05
LOC1720  1.561010e-04
WDR65    4.690162e-03
ANO5     1.614710e-02
ARRDC4   1.700430e-02
HLA-DPA1 1.979342e-02


We might get something usefull out of that after all. I'll continue tomorrow. Thanks so far. (-:

0
Entering edit mode

Yeah, it's really just a matter of what question you want to ask. Your original model was more flexible, but you really need replicates for it to be useful. With the model above, you're restricted to finding genes that change in a more or less linear fashion with time (e.g., if expression doubles between 0h and 2h, then it should then double again every two hours), but you don't really need much in the way of replicates.

BTW, the fold changes produced are now "per hour".

0
Entering edit mode

Fold changes "per hour" - nice to know.

Ryan Devon: "but you don't really need much in the way of replicates."

I suppose that means that I don't really need replicates here.

One could even assume there is an quadratic (time2) fold change. I did this like that:

> colData(dds)
DataFrame with 10 rows and 4 columns
intercept      time timeSquared sizeFactor
<numeric> <numeric>   <numeric>  <numeric>
h0_249_1           1         0               0          0.4031904
h2_249_2           1         2               4          0.5897333
h4_249_3           1         4              16          0.9342609
h6_249_4           1         6              36          1.9014400
h8_249_5           1         8              64          0.9675656
h10_249_6          1        10             100      1.0818049
h12_249_7          1        12             144      0.6057505
h14_249_8          1        14             196      1.2609993
h24_249_9          1        24             576      0.9578546
h48_249_10         1        48            2304      3.7064809


So this time I would like to know what genes fit either to a linear time model or a squared time model (or to no time).

I defined my model like this:

>ddsfm <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ time+timeSquared)
>dds=DESeq(ddsfm,test="LRT",reduced= ~ 1, fitType='local')

> resultsNames(dds)
[1] "Intercept"        "time.timeSquared"    #<---- This is not what I want!


How is it done properly?

I could just do two separate runs

• one with design = ~ time reduced= ~1
• the other with design = ~ timeSquared reduced= ~1

But I am sure there is a better way.

0
Entering edit mode

So anyway I did two separate runs one with ~ Lineartime' vs '~ 1' and one with ~ Squaredtime' vs '~ 1'

Here one can see sufficient low pvalues but then looking at padj it tells me otherwise.

Can these first few genes considered to be differentially expressed?

>resOrdered <- res[order(res\$padj),]

LRT p-value: '~ time' vs '~ 1'
DataFrame with 23709 rows and 5 columns
<numeric>      <numeric>  <numeric>    <numeric>    <numeric>
KRT8    1612.214997    -0.05406463   32.71247 1.068497e-08   8.469979e-05
LOC1720    3.621876     0.13374204   30.17952 3.938465e-08   1.561010e-04
WDR65     55.149914    -0.05988746   22.82437 1.775008e-06  4.690162e-03
ANO5       2.897376     0.11646892   19.90285    8.147898e-06  1.614710e-02
ARRDC4     2.519225     0.11579147   19.37764 1.072556e-05   1.700430e-02

In the time squared model the padj values seem even worse.

LRT p-value: '~ timeSquared' vs '~ 1'
DataFrame with 23709 rows and 5 columns
<numeric>      <numeric> <numeric>    <numeric>      <numeric>
LOC1720       3.621876    0.002265750  21.06191 4.446813e-06    0.07646739
KRT8       1612.214997   -0.001006767  16.32833 5.326171e-05    0.45794416
ANO5          2.897376    0.002025348  15.26385   9.348877e-05    0.46131135
ARRDC4        2.519225    0.001933524  14.07206 1.759376e-04    0.46131135
CHRNA7        3.186293    0.001550793  13.95561 1.871785e-04    0.46131135


I am not sure how to handle the high padj values in that context...

0
Entering edit mode

Yeah, I'd take the first few genes of the first result set as being significant. For the quadratic time model, apparently there's simply not much that can reasonably fits that model.

0
Entering edit mode
7.4 years ago
john ▴ 30

Ok I'll do that!

What's also puzzling me is the high number of NAs in the padj column.

15.000 out of 23.000 genes have NA.

I had a look at the data and this is either caused by:

1. The gene has 0 counts in all time points.
2. The gene has 0 in some of the time points

Here is a snapshot of genes (counts) with NA in padj column.

"A4GALT" 0 0 0 0 0 3 0 0 0 0
"A4GNT" 0 0 1 0 1 0 0 0 0 10
"AA06" 0 0 0 0 0 0 0 0 0 0
"AACS" 1 0 1 0 0 1 0 2 0 0
"AACSP1" 0 0 0 0 0 0 0 0 0 0
"AADAC" 0 0 1 0 0 0 0 0 0 1
"AADACL2" 0 0 0 0 0 0 0 0 2 0


But I guess this is just "normal".

Edit: I remember reading something about NAs in DESeq but cant find it anymore.