Limma for paired and unpaired samples. Strange number of DEGs in unpaired samples
2
0
Entering edit mode
10 months ago
dzisis1986 ▴ 60

Hello,

I have RNA-seq data from two different treatment groups (F and NF ) at 2 different time points (T1 and T2). The mapping was done with STAR aligner and the quantification was done with FeatureCounts. When I perform the paired analysis in F: T2 vs T1 and NF: T2 vs T1. I create a design matrix including the variables Individ+medianIsize+GC+RIN+Plate from my metadata file which looks like:

       Individ age sex fasting timePoint medianIsize    GC   BMI BMI_category extraction_batch_ID extraction_date plate RIN
001_T2   ID001  54   F       F        T2         442 59.10 32.67            4                 101        31/07/19     9 8.4
002_T2   ID002  65   F       F        T2         434 57.20 31.80            4                 101        31/07/19     9 8.9
003_T2   ID003  64   F       F        T2         423 58.17 29.47            3                 100        31/07/19     8 7.5
004_T2   ID004  65   F       F        T2         400 54.90 28.00            3                 100        31/07/19     8 8.5
005_T2   ID005  56   F       F        T2         381 54.77 21.71            2                  99        30/07/19     8 9.1
006_T2   ID006  53   F       F        T2         450 53.18 32.85            4                  63        22/04/19     5 9.5


My count matrix looks like

                001_T2 002_T2 003_T2 004_T2 005_T2 006_T2 007_T2 008_T2 009_T2 010_T2 011_T2 013_T2 014_T2 015_T2 016_T2 017_T2 018_T2
ENSG00000186827     68    248    250    269    143    268    253    293    144    117    183    145    263    334    180    211    197
ENSG00000186891     61    130    185    111    164    120    139    341    111    156    147     95    253    349    121    281    153


In paired analysis limma-voom returns 3302 DEGS (padjust < 0,05) out of 17231 genes for F: T2 vs T1 and 3844 for NF: T2 vs T1.

When I am trying to perform unpaired analysis for comparisons of treatment groups within time points ie T1 : F vs NF or T2: F vs NF, I create a design matrix with different combinations of the covariates: medianIsize + GC + age + sex + BMI+ RIN. In each combination of covariates limma returns very few DEGs (7-32). From a biological standpoint, we expect a large number of DEGs for these comparisons. Below is the table of different versions of covariates and results of limma :

Input Covariates    T1: F vs NF T2: F vs NF
medianIsize + GC    14096   14041
medianIsize + GC + age + sex    5   9
medianIsize + GC + age + sex + BMI  7   10
GC + age + sex + BMI+ RIN   30  13
GC + age + sex + BMI    7   12
medianIsize + GC + age + sex + BMI+ RIN 32  10
medianIsize + GC + age + sex + BMI+plate    7   9
medianIsize + GC + age + sex + BMI+RIN+plate    13  10
age + sex + BMI+RIN+plate   13  6
medianIsize + GC +sex+ RIN+plate    311 156


Since I was expecting more DEGs in unpaired analysis (T1: F vs NF, T2: F vs NF) do you think that this kind of result makes sense based on the combination of covariates that we use? Is it possible the combination of covariates to create a model matrix that is not proper for the limma algorithm in such a way that restricts the number of DEGs produced?

DifferentialExpression RNAseq R Limma • 746 views
0
Entering edit mode
0
Entering edit mode
10 months ago
Gordon Smyth ★ 5.6k

If your experimental design is properly paired, then naturally you get more powerful results from a paired analysis. It isn't correct to treat a paired design as if it was unpaired, ignoring the correlation between repeat samples from the same individuals.

You do seem to have a lot of variables in your models but, without seeing the complete metadata file or the complete model fitted, it's hard to say more.

0
Entering edit mode
10 months ago
dzisis1986 ▴ 60

We have a total of 801 samples. These belong to two groups (i.e. two conditions F and NF) sampled at two timepoints. My metadata file has the 801 samples and the variables we want to use for our analysis. A sample of metadata file is here :

 sampleName Individ age sex fasting timePoint medianIsize    GC   BMI BMI_category extraction_batch_ID extraction_date plate RIN
1     001_T1   ID001  54   F       F        T1         427 57.74 34.63            4                   1        19/02/19     1 9.2
2     001_T2   ID001  54   F       F        T2         442 59.10 32.67            4                 101        31/07/19     9 8.4
3     002_T1   ID002  65   F       F        T1         411 57.08 31.13            4                   1        19/02/19     1 6.2
4     002_T2   ID002  65   F       F        T2         434 57.20 31.80            4                 101        31/07/19     9 8.9
5     003_T1   ID003  64   F       F        T1         430 58.14 31.24            4                   2        19/02/19     1 7.0
6     003_T2   ID003  64   F       F        T2         423 58.17 29.47            3                 100        31/07/19     8 7.5


There are two kinds of comparisons: 1) The paired analysis in which we check for DEGS across timepoints for each group. 2) The unpaired analysis in which we check for DEGs across groups for each timepoint

We have selected co-variates using medianIsize + GC + age + sex + BMI+ RIN. For the paired analysis across timepoints limma outputs 3302 and 3844 DE genes. For the unpaired analysis within timepoints, we get a very low number of discoveries (10-30), when we are expecting more differences across groups. We have tried removing some of the covariates, but the picture doesn’t change much unless we remove a specific covariate (BMI) which we know affects our findings.

Is there another way to set up our analysis to uncover the differences we expect across groups, within each timepoint?

2
Entering edit mode

Your new post contains no new information that wasn't in your original question and my answer remains the same as before.

Your unpaired analysis is not correct for the reasons I explained before but, even if done correctly, comparing between groups will always give far fewer DE genes than comparing time-points because the former does not adjust for baseline differences between subjects whereas the latter does. That is a basic scientific limitation, not a software issue.

Please see the section in the limma User's Guide on multi-level experiments.

There are many ways to increase power in limma, like using sample weights or generating surrogate variables. But the best thing would be to collaborate with an experienced statistical bioinformatician at your own institution.

0
Entering edit mode

Thank you very much for the detailed reply Gordon. I will check the User's Guide again for multilevel experiments and i will try to get advice from a more expert bioinformaticial than me.