Question

differential gene expression analysis when not all samples have an untreated counterpart

0

Entering edit mode

3 months ago

nhaus ▴ 360

I have ~1000 RNAseq samples that come from 100 donors and am using edgeR to analyse it. The tissue from the 100 donors was treated with either 9 different chemicals (A, B, C, ...) or not treated at all (control).

Unfortunatly, due to technical reasons, I had to remove some of the control condition (only 3) due to a low number of reads (<5e6). This means that I have some "unpaired" samples, i.e. they received the treatment but I have no control for them. Is it correct, that I can still formulate a design model like this, if I am just interested in the average difference between for example treatment5 and the control?

#my design (for simplicity all batch effects are ignored)
design = ~ patient.id + treatment

A related question is the following. Because I have so many samples, I am using sva to adjust for batch effects. Is it correct, that after using this approach, I can remove the patient.id from my design because the identified surrogate variables account for patient-specific effects.

#my design after SVA
design = ~ SV1 + SV2 + SV3 +SV4+ SV5 ... + SV12 + treatment

Any input is very much appreciated!

Cheers

differential-expression edgeR • 533 views

ADD COMMENT • link 3 months ago by nhaus ▴ 360

score 2 · Accepted Answer · 2024-04-17

2

Entering edit mode

3 months ago

Gordon Smyth ★ 7.3k

I had to remove some of the control condition (only 3) due to a low number of reads (<5e6)

edgeR doesn't have any trouble with low numbers of reads and 5e6 is certainly no problem. If the smaller library sizes are associated with some other quality issue then, fine, remove the samples, but don't remove the samples just because of the low numbers.

I can still formulate a design model like this, if I am just interested in the average difference between for example treatment5 and the control?

Yes. The three donors with no control won't contribute to the results but the analysis is still correct.

after using this approach, I can remove the patient.id from my design because the identified surrogate variables account for patient-specific effects

Impossible to say. It depends on your data, particularly on how the donors are associated with batches. Normally I would prefer to keep donor in the model.

ADD COMMENT • link 3 months ago by Gordon Smyth ★ 7.3k

0

Entering edit mode

Thank you very much for your answer, that was very helpful!

Impossible to say. It depends on your data, particularly on how the donors are associated with batches. Normally I would prefer to keep donor in the model.

I did some more investigation of my data and totally agree with you. For me, SVA was unable to "pick up" the donor effect (i.e. still siginificant association between donor and PC1-5). To account for this, my idea was to include this in the model that I give to sva like so:

mod <- model.matrix(~donor+group, colData(dds))
mod0 <- model.matrix(~donor, colData(dds))
svseq <- svaseq(norm_counts, mod, mod0, n.sv = 5)

Do you by chance know if this is the correct approach or should mod0 look like this: model.matrix(~1, colData(dds)).

Furthermore, I downstream I am using limma::removeBatchEffect for visualizations. Is the following usage in this case correct given my SVA code:

mod <- model.matrix(~group, tmp_mdata)
formula <- as.formula(paste("~", paste0("SV", 1:5, collapse = " + ")))
mod0 <- model.matrix(formula, tmp_mdata)[, -1]
# adjust the counts using limma
adjusted_counts <- limma::removeBatchEffect(norm_data,
        batch = mdata$donor,
        covariates = mod0,
        design = mod
    )

Any help is much appreciated!

ADD REPLY • link 3 months ago by nhaus ▴ 360