Deseq2: What is the most correct experimental design for this experiment?
1
0
Entering edit mode
5.9 years ago
salamandra ▴ 550

My RNA-seq experiment layout is like this:

d0_CD49fneg_CD34neg_1
d0_CD49fneg_CD34neg_2
d0_CD49fneg_CD34neg_3
d15_CD49fpos_CD34neg_1
d15_CD49fpos_CD34neg_2
d25_CD49fpos_CD34neg_1
d25_CD49fpos_CD34neg_2
d25_CD49fpos_CD34neg_3
d25_CD49fpos_CD34pos_1
d25_CD49fpos_CD34pos_2
d25_CD49fpos_CD34pos_3

I want to account for all variables when determining differentialy expressed genes with Deseq2. So I did this experimental design:

sampleNames <- c("BJ_1", "BJ_2", "BJ_3", "d15_CD49f+_1", "d15_CD49f+_2", "d25_CD49f+_1", "d25_CD49f+_2", "d25_CD49f+_3", "d25_CD49f+CD34+_1", "d25_CD49f+CD34+_2", "d25_CD49f+CD34+_3")
sampleFiles <- c("1A_ATCACG_counts", "2A_CAGATC_counts", "3A_ATCACG_counts", "2B_ACTTGA_counts", "3B_CGATGT_counts", "1C_TTAGGC_counts", "2C_GATCAG_counts", "3C_TTAGGC_counts", "1E_ACAGTG_counts", "2E_GGCTAC_counts", "3F_GCCAAT_counts")
Time <- factor(c(rep('d0',3), rep('d15',2),rep('d25',6)))
CD49 <- factor(c(rep('CD49fneg',3), rep('CD49fpos',8)))
CD34 <- factor(c(rep('CD34neg',8), rep('CD34pos',3)))
Table <- data.frame(sampleName = sampleNames, fileName = sampleFiles, time = Time, CD49 = CD49, CD34 = CD34)
Table
             sampleName         fileName time     CD49    CD34
    1               BJ_1 1A_ATCACG_counts   d0 CD49fneg CD34neg
    2               BJ_2 2A_CAGATC_counts   d0 CD49fneg CD34neg
    3               BJ_3 3A_ATCACG_counts   d0 CD49fneg CD34neg
    4       d15_CD49f+_1 2B_ACTTGA_counts  d15 CD49fpos CD34neg
    5       d15_CD49f+_2 3B_CGATGT_counts  d15 CD49fpos CD34neg
    6       d25_CD49f+_1 1C_TTAGGC_counts  d25 CD49fpos CD34neg
    7       d25_CD49f+_2 2C_GATCAG_counts  d25 CD49fpos CD34neg
    8       d25_CD49f+_3 3C_TTAGGC_counts  d25 CD49fpos CD34neg
    9  d25_CD49f+CD34+_1 1E_ACAGTG_counts  d25 CD49fpos CD34pos
    10 d25_CD49f+CD34+_2 2E_GGCTAC_counts  d25 CD49fpos CD34pos
    11 d25_CD49f+CD34+_3 3F_GCCAAT_counts  d25 CD49fpos CD34pos

But when running this:

dds2 <- DESeqDataSetFromHTSeqCount(sampleTable = Table, design= ~ time + CD49 + CD34)

I got this error:

Error in checkFullRank(modelMatrix) : 
  the model matrix is not full rank, so the model cannot be fit as specified.
  One or more variables or interaction terms in the design formula are linear
  combinations of the others and must be removed.

  Please read the vignette section 'Model matrix not full rank':

  vignette('DESeq2')

From what I understood of vignette section, the error shows because CD49fneg vs CD49fpos is the 'same' as d0 vs d15 or d0 vs d25. So I 'merged' time and CD49f and repeated:

CellType <- factor(paste0(Time,CD49))
Table <- data.frame(sampleName = sampleNames, fileName = sampleFiles, CellType = CellType, CD34 = CD34)
Table
         sampleName         fileName    CellType    CD34
1               BJ_1 1A_ATCACG_counts  d0CD49fneg CD34neg
2               BJ_2 2A_CAGATC_counts  d0CD49fneg CD34neg
3               BJ_3 3A_ATCACG_counts  d0CD49fneg CD34neg
4       d15_CD49f+_1 2B_ACTTGA_counts d15CD49fpos CD34neg
5       d15_CD49f+_2 3B_CGATGT_counts d15CD49fpos CD34neg
6       d25_CD49f+_1 1C_TTAGGC_counts d25CD49fpos CD34neg
7       d25_CD49f+_2 2C_GATCAG_counts d25CD49fpos CD34neg
8       d25_CD49f+_3 3C_TTAGGC_counts d25CD49fpos CD34neg
9  d25_CD49f+CD34+_1 1E_ACAGTG_counts d25CD49fpos CD34pos
10 d25_CD49f+CD34+_2 2E_GGCTAC_counts d25CD49fpos CD34pos
11 d25_CD49f+CD34+_3 3F_GCCAAT_counts d25CD49fpos CD34pos

dds2 <- DESeqDataSetFromHTSeqCount(sampleTable = Table, design= ~ CD34 + CellType)

What I want to know is:

1 - Is this experimental design correct?

2 - If I want fold-change for e.g. CD49fneg vs CD49fpos, including all samples of CD49fpos (those CD34neg and those CD34pos) how do I do?

Deseq2 experimental design RNA-seq • 2.7k views
ADD COMMENT
1
Entering edit mode
5.9 years ago
  1. Get rid of CD49 and keep time rather than the other way around. This will produce the exact same results as ~CD34 + CellType but might prove a bit clearer when you think about it.
  2. Use an LRT and drop time from the reduced model.
ADD COMMENT
0
Entering edit mode

1 - You mean like this?:

CellType2 <- factor(paste0(CD49,CD34))

Table3 <- data.frame(sampleName = sampleNames, fileName = sampleFiles, CellType2 = CellType2, Time=Time)
Table3

dds3 <- DESeqDataSetFromHTSeqCount(sampleTable = Table3, design= ~ Time + CellType2)

It gives the same error...

2 - are you saying to do like this?

dds3 <- DESeqDataSetFromHTSeqCount(sampleTable = Table3, design= ~ Time + CellType2)

dds3 <- dds3[rowSums(counts(dds3)) > 1, ]

ddsHTSeq3 <- DESeq(dds3, test='LRT', reduced = ~Time)

res <- results(dds3)
ADD REPLY
1
Entering edit mode
  1. No, just remove CD49 from your design entirely. Every time other than time 0 is CD49 positive, so it's redundant (this causes the error you saw).
  2. Yes, though with a full design of ~CD34 + time.
ADD REPLY
0
Entering edit mode

Thank you for the answer, but now I'm a bit confused. The 'reduced' variable isn't the one we want to remove? If we want to know differences between CD49+ and CD49- (or in other words d0 vs d15+d25) shouldn't we remove the other variable CD34 (reduced=CD34)?

ADD REPLY
0
Entering edit mode

No, the reduced design is what's left after removing what you're interested in. What an LRT does is compare two model, the difference between them then being what you're statistically testing.

ADD REPLY
0
Entering edit mode

Exackly, what I'm interested in is time ( or cd49), so the reduced design (what's left) shouldn't be CD34? Or the reduced option defines what you remove, instead of what's left? I'm sorry if I'm missing the obvious

ADD REPLY
1
Entering edit mode

I must have misunderstood at some point, I though you were interested in CD40 status. Yes, if you're interested in testing CD34 status then you can remove it from the reduced model (or make your life easier and use results(dds, coef=CD34_neg_vs_pos)).

ADD REPLY
0
Entering edit mode

I'm having problems in understanding confounding variables, because imagine this situation:

time <- factor(LETTERS[1:4])
celltype <- factor(c(1,2,2,3))
sampleNames <- c("hey","hi","HRY", "GT")
sampleFiles <- c("1A_ATCACG_counts", "2A_CAGATC_counts", "3A_ATCACG_counts", "2B_ACTTGA_counts")
sampleIds <- factor(sampleNames)
table <- data.frame(sampleName = sampleNames, fileName = sampleFiles, time = time, celltype = celltype ,sampleIds = sampleIds)
table
sampleName         fileName time celltype sampleIds
1        hey 1A_ATCACG_counts    A        1       hey
2         hi 2A_CAGATC_counts    B        2        hi
3        HRY 3A_ATCACG_counts    C        2       HRY
4         GT 2B_ACTTGA_counts    D        3        GT

I wanted to get the genes that vary across celltype (using a reduced model) and then split those genes that increase and decrease from those that increase and maintain expr. over celltype (using degPatterns). But when I do:

dds <- DESeqDataSetFromHTSeqCount(sampleTable = table, design= ~ time + celltype)

get the confounding variables error:

Error in checkFullRank(modelMatrix) : 
  the model matrix is not full rank, so the model cannot be fit as specified.
  One or more variables or interaction terms in the design formula are linear
  combinations of the others and must be removed.

  Please read the vignette section 'Model matrix not full rank':

  vignette('DESeq2')

what should I do then to get genes varying across celltype? cause celltype '2' occurs at two different time points (B and C), so can't combine both variables into one.

ADD REPLY
0
Entering edit mode

There's nothing you can do, the study was not set up in a way to allow you to test what you want to test.

ADD REPLY
0
Entering edit mode

And simply removing time of the table/design wouldn't be correct? I ask because most of variance in PCA plot (PC1 and PC2) is due to cell type. Not even in PC2 and PC3 time seems to vary.

ADD REPLY
0
Entering edit mode

If time is really not affecting anything then yes you can remove it, it's just that in the design you showed it's mathematically impossible to tell any difference between it and the cell types. What does time actually represent here?

ADD REPLY
0
Entering edit mode

The experience consists in converting one cell type in the other. So, at different time points of the process we did rna_seq. Celltype is the phenotype we observed. So for one of the time points we observed two phenotypes and variance was big and for two other time points which shared same phenotype variance was small. We know which cell type are more mature, so although for one time point we have two phenotype we know which one is more immature (occurs first)

ADD REPLY
1
Entering edit mode

OK, in that case you don't actually need to model time. I'm assuming you did your library preparation/RNA extraction at a single time point rather than as you made the cell lines, since in my experience that can create a significant batch effect.

ADD REPLY
0
Entering edit mode

thank you :) I'm not sure, as I haven't done it. I'll just play safe and work with each combination of time and celltype, intead of time alone or cell type alone

ADD REPLY

Login before adding your answer.

Traffic: 1800 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6