Question

Deseq2 design matrix for multiple factors

0

Entering edit mode

4.0 years ago

Ankit ▴ 500

Hi everyone,

I have a query related to design matrix in Deseq2. I have RNAseq sample from multiple labs. So this is basically a batch effect.

Batch: 5 labs : 29 samples. Each has atleast two replicates
RunType: Some are single end some are paired end
Condition: Some are wild type and some are knockout.

How to make a design matrix in Deseq2?

I thought for some possible combinations:

design <-  model.matrix(~Batch + Condition + RunType)

design <- model.matrix(~Batch + Condition:Batch + RunType:Batch)

design <- model.matrix(~Condition + Batch + RunType)

design <- model.matrix(~Condition + Batch: Condition + RunType: Condition)

Which one is correct to remove any batch effect present?

Or any other possible combinations which I am missing. I am not sure how to model the design for such three possible factors.

I am not sure if I have to perform interaction also between three factors.

Please help.

Thanks

RNA-Seq deseq2 design matrix • 2.1k views

ADD COMMENT • link updated 4.0 years ago by ATpoint 82k • written 4.0 years ago by Ankit ▴ 500

2

Entering edit mode

Impossible for anyone to know, really. Each experiment is unique and much 'back and forth' in the analysis is required.

I would start with:

~ Condition + Batch + RunType

Then, check PCA bi-plots for each variable in the design. If, for example, RunType has no apparent effect, then remove it from the model.

I don't see any need for an interaction term, in this case.

ADD REPLY • link 4.0 years ago by Kevin Blighe 87k

0

Entering edit mode

I think you can merge Batch and RunType information together, that would be you batch effect and then use ~Batch+Condition

ADD REPLY • link 4.0 years ago by piyushjo ▴ 700

0

Entering edit mode

Do you have replicates of each condition in every of the batches? If not then I doubt you can (or should) try to compare these. At least treat everything as single-end, this then eliminates that confounder. Can you post a table that indicates which sample is which condition and from which lab it comes?

ADD REPLY • link 4.0 years ago by ATpoint 82k

0

Entering edit mode

No.In each condition of batches no. Is there any alternative to resolve such limitations? here is the table

sample_ids  batch_ids   Cell_type   Lab Condition   Replicates
oEwt1   1   E   Lab1    wt  1
oEwt2   1   E   Lab1    wt  2
oEko1   1   E   Lab1    ko  1
oEko2   1   E   Lab1    ko  2
ID_s32  2   E   Lab2    wt  1
ID_s33  2   E   Lab2    wt  2
ID_s34  2   N   Lab2    wt  1
ID_s35  2   N   Lab2    wt  2
ID_s38  2   E   Lab2    wt  1
ID_s39  2   E   Lab2    wt  2
ID_s40  2   N   Lab2    wt  1
ID_s41  2   N   Lab2    wt  2
ID_s44  2   N   Lab2    wt  1
ID_s45  2   N   Lab2    wt  2
ID_s46  2   N   Lab2    wt  3
ID_s50  2   N   Lab2    wt  1
ID_s51  2   N   Lab2    wt  2
ID_s52  2   N   Lab2    wt  3
ID_s53  2   N   Lab2    wt  1
ID_s54  2   N   Lab2    wt  2
ID_s55  2   N   Lab2    wt  3
oNwt1   3   N   Lab3    wt  1
oNwt2   3   N   Lab3    wt  2
oNko1   3   N   Lab3    ko  1
oNko2   3   N   Lab3    ko  2
ID_k89  4   N   Lab4    wt  1
ID_k90  4   N   Lab4    wt  2
ID_m24  5   E   Lab5    wt  1
ID_m25  5   E   Lab5    wt  2
ID_m27  5   E   Lab5    ko  1
ID_m28  5   E   Lab5    ko  2

ADD REPLY • link 4.0 years ago by Ankit ▴ 500

0

Entering edit mode

The other possibility I thought is to look at some sets of genes like 50-60 genes . These genes are like markers of cell type E and N. I want to see the expression of only these genes by heatmap in my collected datasets. I want to normalize the normalized counts by loading control expression .. like beta actin. Then compare only the expression of these genes across the samples. This way atleast I can see how.my ko are different from wt for these specific marker genes and may reveal if ko has some kind of impaired expression of marker genes. Does it make some sense?

Please help

ADD REPLY • link 4.0 years ago by Ankit ▴ 500

score 2 · Answer 1 · 2020-05-06

RNA-seq is strongly confounded by library preparation (which would be lab here). I suggest you use lab1, 3 and 5 and for each you perform separate pairwise comparison. Then use meta-analysis to compare the results. Even though the single-cell world now claims to have methods that can integrate different datasets from different platforms etc I doubt this is robust in standard RNA-seq with low numbers of replicates. You can simply check that by putting all these data from above in the same analysis, normalize together and then perform PCA. I anticipate they will strongly cluster by lab rather than condition. Unfortunately you cannot simply collect unrelated experiments and then expect to get proper results when combining them. They are confounded. Therefore do separate analysis followed by meta. Design would simply be ~ Condition here.