Experiment:

Question

Handling biological replicates across batches in DESeq2

2

Entering edit mode

5.5 years ago

Kristin Muench ▴ 620

EDIT:

Cross-posted on Bioconductor: https://support.bioconductor.org/p/114506/

Hello,

I am trying to figure out how to handle biological replicates across batches in DESeq2's design matrix. They're not quite technical replicates, but trying to include the sample ID as a factor in the design matrix produces a 'Model matrix not full rank' error. What do you think I should do?

Experiment:

We have three variables of interest in our analysis: Timepoint (A and B), Sex (M or F), and Condition (cond1 or cond2).

Note that a given sample can only have one timepoint, one sex, and one condition - so it's not like we can split the same tissue and look at it at Timepoint A and B. Each possible Timepoint x Sex x Condition has its own unique samples.

In our first sequencing batch, we collected samples for each possible combination of conditions.

In our second batch, we took some of the same RNA samples from the first sequencing batch PLUS some new RNA samples, re-generated libraries from all of these, and then sequenced.

In the end, we have a sample table that looks like this:

SAMPLE | BATCH | CONDITION | SEX | TIMEPOINT
------  -------  ---------   ----  ---------
samp1  |  1    |   cond1   |  F  |    A
samp1  |  2    |   cond1   |  F  |    A
samp2  |  1    |   cond1   |  M  |    A
samp2  |  2    |   cond1   |  M  |    A
samp3  |  1    |   cond1   |  F  |    A
samp4  |  2    |   cond1   |  F  |    A

...and so forth.

Problem

I tried to run DESeq with this:

design(dds) <- ~Genotype + Sex + Condition + SequencingBatch + Sample

and also with this:

design(dds) <- ~Genotype + Sex + Condition + SequencingBatch_Sample

...where SequencingBatch_Sample is a combination of SequencingBatch and Sample, like "samp1_1", samp1_2" for sample 1/batch1 and sample1/batch2.

These designs, of course, throw a 'Matrix not full rank' error, presumably because it is not possible to have the same sample have every possible combination of Condition/Sex/Timepoint, since each sample can only have one quality each of Condition, Sex, and Timepoint.

However, I'm not quite sure what to do instead.

Using collapseReplicates() doesn't feel appropriate because these aren't really technical replicates - a whole new library was produced for the second batch, just using the same RNA.

Doing nothing seems problematic because then the 'duplicated' signal from a subset of samples threatens to overwhelm the differential expression results.

All suggestions are very welcome - thank you!

deseq2 rna-seq R deseq RNA-Seq • 6.6k views

ADD COMMENT • link 5.5 years ago by Kristin Muench ▴ 620

3

Entering edit mode

I would ask this question at the Bioconductor support page. Mike Love (the developer of DESeq2) is outstandingly responsive, especially when dealing with well-written questions like yours. I asked something about shrinkage estimators a few hours ago and got an elaborate answer including code back just a few hours later.

ADD REPLY • link 5.5 years ago by ATpoint 82k

0

Entering edit mode

Ah thank you! I haven't interacted with Bioconductor forum much, I'll give it a try.

ADD REPLY • link 5.5 years ago by Kristin Muench ▴ 620

2

Entering edit mode

Please share his answer here once you get one. I would be interested in his expertise on this one, too.

ADD REPLY • link 5.5 years ago by ATpoint 82k

1

Entering edit mode

Hello - I posted his reply below, hope it helps

ADD REPLY • link 5.5 years ago by Kristin Muench ▴ 620

0

Entering edit mode

I'd start out by doing PCA on these samples. Maybe you will luck out, and observe that some of your categories don't make much difference. My guess is you will not be able to correct based on sample, since each sample is only in there once or twice.

ADD REPLY • link 5.5 years ago by swbarnes2 14k

0

Entering edit mode

Thanks for the suggestion! Unfortunately, it does seem like all of the aforementioned factors do separate within the first two PCs - except for Sex, which we know should absolutely have a large expression difference in all conditions at all timepoints. Replicate samples in non-batch-corrected data usually, but don't always, cluster together.

ADD REPLY • link 5.5 years ago by Kristin Muench ▴ 620

1

Entering edit mode

5.5 years ago

Friederike 8.9k

If I read your description correctly, you have a set of samples for which you have 2 batches and another set of samples for which you have only 1 batch? I'd drop "batch" from the picture, similar to what you tried with SequencingBatch_Sample, i.e. combine the info about batch and sample into one name, and then use ~Genotype + Sex + Condition

EDIT: This way, all the samples will serve as replicates of each other if they share the same genotype, sex, condition.

ADD COMMENT • link 5.5 years ago by Friederike 8.9k

0

Entering edit mode

Ah thank you! This may be the way to go. My concern with combining these factors is that it will make downstream analyses difficult because we are interested in the main effect of each (e.g. overall effect of Genotype, overall effect of Sex) as well as interaction effect (e.g. overall effect of Genotype:Sex). If we combined the three into a single factor, we'd have to content with doing many many pairwise analyses, e.g. Condition1+SexF+TimepointA vs. Condition1+SexF+TimepointB, Condition1+SexF+TimepointA vs. Condition1+SexM+TimepointA... etc. Perhaps this is the best solution, however :/

ADD REPLY • link 5.5 years ago by Kristin Muench ▴ 620

0

Entering edit mode

I did not suggest to combine Sex, Condition, Genotype. I suggested to collapse Batch & Sample.

ADD REPLY • link 5.5 years ago by Friederike 8.9k

0

Entering edit mode

~Genotype + Sex + Condition should allow you to query all of the contrasts you mentioned were of interest to you, i.e.

> sample_info
DataFrame with 10 rows and 3 columns
         condition         sex    genotype
       <character> <character> <character>
SNF2_1        SNF2        male         Het
SNF2_2        SNF2      female         Het
SNF2_3        SNF2        male         Hom
SNF2_4        SNF2      female         Hom
SNF2_5        SNF2        male         Het
WT_1            WT      female         Het
WT_2            WT        male         Het
WT_3            WT      female         Hom
WT_4            WT        male         Hom
WT_5            WT      female         Hom

## generate DESeq object
> DESeq.ds <- DESeqDataSetFromMatrix(countData = readcounts,
                              colData = sample_info,
                              design = ~ condition + sex + genotype)

## run DESeq analysis
> DESeq.ds <- DESeq(DESeq.ds)

> resultsNames(DESeq.ds)
[1] "Intercept"            "condition_WT_vs_SNF2" "sex_male_vs_female"   "genotype_Hom_vs_Het"

ADD REPLY • link 5.5 years ago by Friederike 8.9k

0

Entering edit mode

Thank you for clarifying. I wonder if there might still be an issue that individual samples are correlated to each other that might not be accounted for in the design/might still influence results? Mike Love suggested one strategy (post below) but I think I'll try this to see how much of a difference it makes. Thanks again

ADD REPLY • link 5.5 years ago by Kristin Muench ▴ 620

0

Entering edit mode

individual samples are correlated to each other that might not be accounted for in the design

Not sure what you mean by this.

Michael Love has explained the issue quite well. Either your technical replicates are very similar to each other --> in this case, merge them. I would probably do that, for example, if you see them cluster together very neatly in either hclust or PCA analyses.

If the replicates are not very similar to each other, I'm not sure how much harm it would do in a practical sense to pretend they are biological replicates (which is what my suggested approach effectively does). This you will find out if you compare Michael's approach with mine.

ADD REPLY • link 5.5 years ago by Friederike 8.9k

score 1 · Accepted Answer · 2018-10-29

Hello,

Back-propagating the answer that Mike Love suggested on Bioconductor forums:

If there is not a lot of technical variation across replicates > consider treating the samples like technical replicates, and collapse them in DESeq using the collapseReplicates() function.

If there is a lot of technical variation across replicates (as in my case) - "even though there is no biological variation that you would expect" -

"The only approach I can think of, to inform the model that you have technical replicates nested within condition which are not biological replicates, and yet also not similar enough to justify collapsing them, is to use the duplicateCorrelation() approach with limma-voom. So I'd take a look at that framework for approaching this experiment.

It's not possible to use fixed effects to account for the duplicate nature of technical replicates within a condition, and this is the way that DESeq2 controls for variables. You can do within-individual comparisons for various complex designs, but you can't "control" for the correlations among certain related samples within a condition and then compare directly across condition with our implementation. "