pseudobulk differential expression design matrix
Entering edit mode
3 months ago
nhaus ▴ 360

Hi all,

I have the following situation and I just want to make sure that I understand everything correctly from a statistics point of view...

I run a pseudobulk differential expression analysis, where we have a treatment group and a control group. Each group has two replicates (i.e. Ctrl_1, Ctrl_2, Treat_1 and Treat_2). The replicates were performed in batches, i.e. replicate 1 in batch 1 and replicate 2 in batch 2. After summarizing all the counts for one cell population of interest, we end up with metadata that essentially looks like this:

sample_id group_id batch
Ctrl_1 Ctrl 1
Ctrl_2 Ctrl 2
Treat_1 Treat 1
Treat_2 Treat 2

I am interested in comparing Treat vs Ctrl while adjusting for batch, so our model matrix looks like this: mm <- model.matrix(~ batch + group_id, data = mdata)

(Intercept) batch2 group_idTreat
1 0 0
1 1 0
1 0 1
1 1 1

This is all very straight forward.

Here is where the part comes which confuses me slightly. We are using a method, which classifies some cells from the Treat group as controls (because the experimental perturbation did not properly work). This means that we end up with new group_ids, namely: Ctrl_like and Treat_like. I am still interested in comparing the expression of Treat_like vs Ctrl_like, but is my assumption correct, that it is now impossible to perform a standard pseudobulk differential expression analysis, because one sample (i.e Treat_1) can belong to two groups (i.e. Ctrl_like and Treat_like) simultaneously and thus it is not possible anymore to adjust for batch effects? This is how the meta data would look like:

sample_id group_id batch
Ctrl_1 Ctrl_like 1
Ctrl_1 Treat_like 1
Ctrl_2 Ctrl_like 2
Ctrl_2 Treat_like 2
Treat_1 Treat_like 1
Treat_1 Ctrl_like 1
Treat_2 Treat_like 2
Treat_2 Ctrl_like 2

Any insights on that matter are greatly appreciated!

pseudobulk single-cell scRNA-seq • 361 views
Entering edit mode
3 months ago

You should be fine because every sample_id and batch apppears in every group_id. You would use the formula ~ group_id + sample_id + batch.

Entering edit mode

Will this account for the fact that some cells come from the same original sample? This seems like relevant information for a correct analysis.

Entering edit mode

Also, I just tried to do a formula like this and got the following error: Design matrix not of full rank.

I assume that is because the design matrix has columns that are linearly dependent? I.e. the sample_id column also encodes the batch column. Is that correct?


Login before adding your answer.

Traffic: 1687 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6