Entering edit mode

5.9 years ago

salamandra
▴
550

My RNA-seq experiment layout is like this:

```
d0_CD49fneg_CD34neg_1
d0_CD49fneg_CD34neg_2
d0_CD49fneg_CD34neg_3
d15_CD49fpos_CD34neg_1
d15_CD49fpos_CD34neg_2
d25_CD49fpos_CD34neg_1
d25_CD49fpos_CD34neg_2
d25_CD49fpos_CD34neg_3
d25_CD49fpos_CD34pos_1
d25_CD49fpos_CD34pos_2
d25_CD49fpos_CD34pos_3
```

I want to account for all variables when determining differentialy expressed genes with Deseq2. So I did this experimental design:

```
sampleNames <- c("BJ_1", "BJ_2", "BJ_3", "d15_CD49f+_1", "d15_CD49f+_2", "d25_CD49f+_1", "d25_CD49f+_2", "d25_CD49f+_3", "d25_CD49f+CD34+_1", "d25_CD49f+CD34+_2", "d25_CD49f+CD34+_3")
sampleFiles <- c("1A_ATCACG_counts", "2A_CAGATC_counts", "3A_ATCACG_counts", "2B_ACTTGA_counts", "3B_CGATGT_counts", "1C_TTAGGC_counts", "2C_GATCAG_counts", "3C_TTAGGC_counts", "1E_ACAGTG_counts", "2E_GGCTAC_counts", "3F_GCCAAT_counts")
Time <- factor(c(rep('d0',3), rep('d15',2),rep('d25',6)))
CD49 <- factor(c(rep('CD49fneg',3), rep('CD49fpos',8)))
CD34 <- factor(c(rep('CD34neg',8), rep('CD34pos',3)))
Table <- data.frame(sampleName = sampleNames, fileName = sampleFiles, time = Time, CD49 = CD49, CD34 = CD34)
Table
sampleName fileName time CD49 CD34
1 BJ_1 1A_ATCACG_counts d0 CD49fneg CD34neg
2 BJ_2 2A_CAGATC_counts d0 CD49fneg CD34neg
3 BJ_3 3A_ATCACG_counts d0 CD49fneg CD34neg
4 d15_CD49f+_1 2B_ACTTGA_counts d15 CD49fpos CD34neg
5 d15_CD49f+_2 3B_CGATGT_counts d15 CD49fpos CD34neg
6 d25_CD49f+_1 1C_TTAGGC_counts d25 CD49fpos CD34neg
7 d25_CD49f+_2 2C_GATCAG_counts d25 CD49fpos CD34neg
8 d25_CD49f+_3 3C_TTAGGC_counts d25 CD49fpos CD34neg
9 d25_CD49f+CD34+_1 1E_ACAGTG_counts d25 CD49fpos CD34pos
10 d25_CD49f+CD34+_2 2E_GGCTAC_counts d25 CD49fpos CD34pos
11 d25_CD49f+CD34+_3 3F_GCCAAT_counts d25 CD49fpos CD34pos
```

But when running this:

```
dds2 <- DESeqDataSetFromHTSeqCount(sampleTable = Table, design= ~ time + CD49 + CD34)
```

I got this error:

```
Error in checkFullRank(modelMatrix) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.
Please read the vignette section 'Model matrix not full rank':
vignette('DESeq2')
```

From what I understood of vignette section, the error shows because CD49fneg vs CD49fpos is the 'same' as d0 vs d15 or d0 vs d25. So I 'merged' time and CD49f and repeated:

```
CellType <- factor(paste0(Time,CD49))
Table <- data.frame(sampleName = sampleNames, fileName = sampleFiles, CellType = CellType, CD34 = CD34)
Table
sampleName fileName CellType CD34
1 BJ_1 1A_ATCACG_counts d0CD49fneg CD34neg
2 BJ_2 2A_CAGATC_counts d0CD49fneg CD34neg
3 BJ_3 3A_ATCACG_counts d0CD49fneg CD34neg
4 d15_CD49f+_1 2B_ACTTGA_counts d15CD49fpos CD34neg
5 d15_CD49f+_2 3B_CGATGT_counts d15CD49fpos CD34neg
6 d25_CD49f+_1 1C_TTAGGC_counts d25CD49fpos CD34neg
7 d25_CD49f+_2 2C_GATCAG_counts d25CD49fpos CD34neg
8 d25_CD49f+_3 3C_TTAGGC_counts d25CD49fpos CD34neg
9 d25_CD49f+CD34+_1 1E_ACAGTG_counts d25CD49fpos CD34pos
10 d25_CD49f+CD34+_2 2E_GGCTAC_counts d25CD49fpos CD34pos
11 d25_CD49f+CD34+_3 3F_GCCAAT_counts d25CD49fpos CD34pos
dds2 <- DESeqDataSetFromHTSeqCount(sampleTable = Table, design= ~ CD34 + CellType)
```

What I want to know is:

1 - Is this experimental design correct?

2 - If I want fold-change for e.g. CD49fneg vs CD49fpos, including all samples of CD49fpos (those CD34neg and those CD34pos) how do I do?

1 - You mean like this?:

It gives the same error...

2 - are you saying to do like this?

`CD49`

from your design entirely. Every time other than time 0 is CD49 positive, so it's redundant (this causes the error you saw).`~CD34 + time`

.Thank you for the answer, but now I'm a bit confused. The 'reduced' variable isn't the one we want to remove? If we want to know differences between CD49+ and CD49- (or in other words d0 vs d15+d25) shouldn't we remove the other variable CD34 (reduced=CD34)?

No, the reduced design is what's left after removing what you're interested in. What an LRT does is compare two model, the difference between them then being what you're statistically testing.

Exackly, what I'm interested in is time ( or cd49), so the reduced design (what's left) shouldn't be CD34? Or the reduced option defines what you remove, instead of what's left? I'm sorry if I'm missing the obvious

I must have misunderstood at some point, I though you were interested in CD40 status. Yes, if you're interested in testing CD34 status then you can remove it from the reduced model (or make your life easier and use

`results(dds, coef=CD34_neg_vs_pos)`

).I'm having problems in understanding confounding variables, because imagine this situation:

I wanted to get the genes that vary across celltype (using a reduced model) and then split those genes that increase and decrease from those that increase and maintain expr. over celltype (using degPatterns). But when I do:

get the confounding variables error:

what should I do then to get genes varying across celltype? cause celltype '2' occurs at two different time points (B and C), so can't combine both variables into one.

There's nothing you can do, the study was not set up in a way to allow you to test what you want to test.

And simply removing time of the table/design wouldn't be correct? I ask because most of variance in PCA plot (PC1 and PC2) is due to cell type. Not even in PC2 and PC3 time seems to vary.

If

`time`

is really not affecting anything then yes you can remove it, it's just that in the design you showed it's mathematically impossible to tell any difference between it and the cell types. What does`time`

actually represent here?The experience consists in converting one cell type in the other. So, at different time points of the process we did rna_seq. Celltype is the phenotype we observed. So for one of the time points we observed two phenotypes and variance was big and for two other time points which shared same phenotype variance was small. We know which cell type are more mature, so although for one time point we have two phenotype we know which one is more immature (occurs first)

OK, in that case you don't actually need to model

`time`

. I'm assuming you did your library preparation/RNA extraction at a single time point rather than as you made the cell lines, since in my experience that can create a significant batch effect.thank you :) I'm not sure, as I haven't done it. I'll just play safe and work with each combination of time and celltype, intead of time alone or cell type alone