Batch correction using DESeq2
1
0
Entering edit mode
3.8 years ago
Raheleh ▴ 260

Hi all,

I have RNAseq data (read count) of 96 mouse primary tumors with 15 different genotypes. These 96 samples are sequenced in 10 different days, however most of the data with the same genotype are sequenced at the same day. I am afraid if I do batch correction for sequencing day I also loose biological differences that exist across different genotypes. Any suggestion?

This is my script : After batch correction I see a lot change in the PCA plot

dds <- DESeqDataSetFromMatrix(as.matrix(all), colData, design = ~ Batch)

vsd <- vst(dds, blind = F)
plotPCA(vsd, "Batch")

assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$Batch)
plotPCA(vsd, "Batch")

Part of colData:

  Genotype condition      Batch
1        A   primary 2017-06-29
2        A   primary 2017-06-29
3        A   primary 2017-06-29
4        A   primary 2017-06-29
5        A   primary 2017-06-29
6       AK   primary 2017-11-09
7       AK   primary 2017-11-09
8       AK   primary 2017-11-09
9       AP   primary 2018-04-18
10     AP   primary 2018-04-18
11     AP   primary 2018-04-18
12     AKP   primary 2019-09-12
13     AKP   primary 2019-09-12
14     AKP   primary 2019-09-12

I also look at these questions:

Batch correction in DESeq2

DESeq2, batch effect correction, multiple conditions

Batch effect problem DEG, DESseq2

But still not sure what should I do, I really appreciate any help!

RNA-Seq deseq2 batch vst • 1.3k views
ADD COMMENT
3
Entering edit mode
3.8 years ago

If all your samples are primary, that doesn't belong in the ColData. Just drop it.

The dates you have given are totally deeply confounded with your genotype. So you have to drop them too. If they really represent sequencing dates, then they aren't adding any technical artifacts. If they represent day of RNA extraction, or day of library prep, then you are in deep trouble, because those do impact RNASeq results, and you will have no way of knowing which changes are due to tumor type, and which are due to prep date for tumor types with different dates.

You know your column headers don't have to literally be Condition and Genotype, right?

ADD COMMENT
0
Entering edit mode

Many thanks swbarnes for your prompt reply! Yes, they are all primary tumors but with different genotypes.

Sorry I don't get your question. I named the headers. What they should be?

ADD REPLY
2
Entering edit mode

You cannot make use of a column where every single sample has the same value. There is no point in it being there.

You cannot get rid of or account for batch effect in the dataset you posted, because it is deeply confounded with genotype. You can't make use of it, except as a guide to which genotype comparisons aren't confounded by batch, and which ones are.

However, if 1) All the RNA was extracted on the same day 2) All the libraries were prepped on the same day 3) the dates really are just the instrument run date, you can safely ignore that date, because running libraries on different days does not cause a batch effect.

ADD REPLY
0
Entering edit mode

If I ignore that date, is it correct to add only genotype to the design formula to account for its effect? This script is correct for normalizing the data?

dds <- DESeqDataSetFromMatrix(as.matrix(all), colData, design = ~ Genotype)
vsd <- vst(dds, blind = F)
assay(vsd)

I really appreciate your time and help!

ADD REPLY
0
Entering edit mode

That command line doesn't normalize anything. Normalizing doesn't take your design into account at all. But ~ Genotype is the only design you should be using with that colData.

ADD REPLY
0
Entering edit mode

Oh, the second line of my script was left, sorry. I edited my post. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2518 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6