Hi all,

I have the following situation and I just want to make sure that I understand everything correctly from a statistics point of view...

I run a pseudobulk differential expression analysis, where we have a treatment group and a control group. Each group has two replicates (i.e. Ctrl_1, Ctrl_2, Treat_1 and Treat_2). The replicates were performed in batches, i.e. replicate 1 in batch 1 and replicate 2 in batch 2. After summarizing all the counts for one cell population of interest, we end up with metadata that essentially looks like this:

sample_id | group_id | batch |
---|---|---|

Ctrl_1 | Ctrl | 1 |

Ctrl_2 | Ctrl | 2 |

Treat_1 | Treat | 1 |

Treat_2 | Treat | 2 |

I am interested in comparing Treat vs Ctrl while adjusting for batch, so our model matrix looks like this: `mm <- model.matrix(~ batch + group_id, data = mdata)`

(Intercept) | batch2 | group_idTreat |
---|---|---|

1 | 0 | 0 |

1 | 1 | 0 |

1 | 0 | 1 |

1 | 1 | 1 |

This is all very straight forward.

Here is where the part comes which confuses me slightly. We are using a method, which classifies some cells from the Treat group as controls (because the experimental perturbation did not properly work). This means that we end up with new group_ids, namely: Ctrl_like and Treat_like. I am still interested in comparing the expression of Treat_like vs Ctrl_like, but is my assumption correct, that it is now impossible to perform a standard pseudobulk differential expression analysis, because one sample (i.e Treat_1) can belong to two groups (i.e. Ctrl_like and Treat_like) simultaneously and thus it is not possible anymore to adjust for batch effects? This is how the meta data would look like:

sample_id | group_id | batch |
---|---|---|

Ctrl_1 | Ctrl_like | 1 |

Ctrl_1 | Treat_like | 1 |

Ctrl_2 | Ctrl_like | 2 |

Ctrl_2 | Treat_like | 2 |

Treat_1 | Treat_like | 1 |

Treat_1 | Ctrl_like | 1 |

Treat_2 | Treat_like | 2 |

Treat_2 | Ctrl_like | 2 |

Any insights on that matter are greatly appreciated!

Will this account for the fact that some cells come from the same original sample? This seems like relevant information for a correct analysis.

Also, I just tried to do a formula like this and got the following error: Design matrix not of full rank.

I assume that is because the design matrix has columns that are linearly dependent? I.e. the sample_id column also encodes the batch column. Is that correct?