differential gene expression analysis when not all samples have an untreated counterpart
1
0
Entering edit mode
10 weeks ago
nhaus ▴ 360

I have ~1000 RNAseq samples that come from 100 donors and am using edgeR to analyse it. The tissue from the 100 donors was treated with either 9 different chemicals (A, B, C, ...) or not treated at all (control).

Unfortunatly, due to technical reasons, I had to remove some of the control condition (only 3) due to a low number of reads (<5e6). This means that I have some "unpaired" samples, i.e. they received the treatment but I have no control for them. Is it correct, that I can still formulate a design model like this, if I am just interested in the average difference between for example treatment5 and the control?

#my design (for simplicity all batch effects are ignored)
design = ~ patient.id + treatment

A related question is the following. Because I have so many samples, I am using sva to adjust for batch effects. Is it correct, that after using this approach, I can remove the patient.id from my design because the identified surrogate variables account for patient-specific effects.

#my design after SVA
design = ~ SV1 + SV2 + SV3 +SV4+ SV5 ... + SV12 + treatment

Any input is very much appreciated!

Cheers

differential-expression edgeR • 492 views
ADD COMMENT
2
Entering edit mode
9 weeks ago
Gordon Smyth ★ 7.2k

I had to remove some of the control condition (only 3) due to a low number of reads (<5e6)

edgeR doesn't have any trouble with low numbers of reads and 5e6 is certainly no problem. If the smaller library sizes are associated with some other quality issue then, fine, remove the samples, but don't remove the samples just because of the low numbers.

I can still formulate a design model like this, if I am just interested in the average difference between for example treatment5 and the control?

Yes. The three donors with no control won't contribute to the results but the analysis is still correct.

after using this approach, I can remove the patient.id from my design because the identified surrogate variables account for patient-specific effects

Impossible to say. It depends on your data, particularly on how the donors are associated with batches. Normally I would prefer to keep donor in the model.

ADD COMMENT
0
Entering edit mode

Thank you very much for your answer, that was very helpful!

Impossible to say. It depends on your data, particularly on how the donors are associated with batches. Normally I would prefer to keep donor in the model.

I did some more investigation of my data and totally agree with you. For me, SVA was unable to "pick up" the donor effect (i.e. still siginificant association between donor and PC1-5). To account for this, my idea was to include this in the model that I give to sva like so:

mod <- model.matrix(~donor+group, colData(dds))
mod0 <- model.matrix(~donor, colData(dds))
svseq <- svaseq(norm_counts, mod, mod0, n.sv = 5)

Do you by chance know if this is the correct approach or should mod0 look like this: model.matrix(~1, colData(dds)).

Furthermore, I downstream I am using limma::removeBatchEffect for visualizations. Is the following usage in this case correct given my SVA code:

mod <- model.matrix(~group, tmp_mdata)
formula <- as.formula(paste("~", paste0("SV", 1:5, collapse = " + ")))
mod0 <- model.matrix(formula, tmp_mdata)[, -1]
# adjust the counts using limma
adjusted_counts <- limma::removeBatchEffect(norm_data,
        batch = mdata$donor,
        covariates = mod0,
        design = mod
    )

Any help is much appreciated!

ADD REPLY

Login before adding your answer.

Traffic: 1414 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6