Batch and Sample correction for downstream analysis using DESeq2
1
0
Entering edit mode
4 months ago
NorbertK • 0

Hello everyone,

I am an absolute beginner on sequencing analysis and DESeq2, so please forgive me for possibly mundane questions. I have tried to look up different methods, but couldn't find a fitting answer yet.

I am currently working with sequencing data derived from an Illumina sequencer. The data is divided in groups by genotype and timepoint and I am comparing 1 to 1 each time.

My first issue derives from the fact, that the samples were run in two batches. Batch 1 was sequenced as single-end 75 bp, while Batch 2 was done using paired-end 150 bp. I have manually down-scaled the 150 bp paired data with Trimmomatic and BBMap to conform to my 20 mio, 75 single-end reads before using Salmon to generate quant files.

Question 1: Will this be an issue later on and if yes, are there better alternatives?

My second issue derives from PCA results I receive after running the comparisons with DESeq2 as

ddsTxi <- DESeqDataSetFromTximport(txi, colData = samples_subset, design = ~group)
dds <- results(dds)
vsd <- vst(dds, blind = T)
PCA <- plotPCA(vsd, intgroup=c('group'))

The resulting image shows a distinctive batch effect between the two data sets. My first thought was changing the DESeq2 internal design using

design = ~group + batch

And while the Volcano plot, which I also generate from res <- results(dds) changes, the PCA plot is unfazed. Another option I have found was using limma's removeBatchEffect(). This I run for the vsd variable after DESeq2 and it definitely improves my PCA results.

The issue lies therein, that removeBatchEffect() only affects the vds variable which is passed to plotPCA. My normalized count table, my TPM tables, as well as the DESeq2 data output are unaffected by these changes.

This should definitely give problems with downstream GSEA as it doesn't affect the overall data?

Question 2: How can I apply batch correction to my data, so that I have conclusive results downstream correcting for my TPM, normCount and DESeq result table?

Thank you everyone for your time

Relevant Code:

txi <- tximport(files, type = salmon, ignoreTxVersion=TRUE, tx2gene = tx2gene)

all_TPM <- txi$abundance
colnames(all_TPM) <- samples$tag
all_counts <- txi$counts
colnames(all_Counts) <- samples$tag
all_Log2TPM <- log2(all_TPM+1)

ddsTxi <- DESeqDataSetFromTximport(txi, colData = samples_subset, design = ~group + batch)
ddsTxi$group <- factor(ddsTxi$group, levels = c(groupA, groupB))
dds <- results(dds)
vsd <- vst(dds, blind = F)
assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$batch)

PCA <- plotPCA(vsd, intgroup=c('group'))
   [...]   #Setting up all stuff for PCA output

volcano = EnhancedVolcano(res, lab = rownames(res), x = 'log2FoldChange', y = 'pValue')
   [...]   #Setting up stuff for Volcano output

normCounts <- as.data.frame(counts(dds, normalized = TRUE))
   [...]   #Output normCount csv

resdf <- as.data.frame(res)
resdf$padj[is.na(resdf$padj)] <- 1
   [...]   #Output DESeq results csv
Batch-Correction normCounts DESeq2 Downscaling • 920 views
ADD COMMENT
0
Entering edit mode

Is only the sequencing mode the batch or have libraries prepared in different timepoints/batches? If it's only the sequencing itself then treating the paired-end sample as single-end (only using R1) and trimming them all to the same read length should eliminate any batch effect.

ADD REPLY
0
Entering edit mode

Sadly the second batch is an entire new library prep, as the samples were mismatched and had to be fully re-done. As such trimming and single-end treating the reads is not eliminating the batch effect, just reducing it by 1 stage

ADD REPLY
1
Entering edit mode
4 months ago

Are your groupings shared across batches? That is, do you have samples from all comparisons of interest in both batches?

If so, combat-seq can be effective in removing the effect, though you have to be careful with such methods that actually adjust the counts rather than modeling the effect. Modeling it (as you are doing) is generally safer, but adjusting the counts directly can be useful for viz (like the PCA or heatmaps). And DESeq2 can be run directly on the output, allowing for simple comparisons between the methods.

ADD COMMENT
0
Entering edit mode

Thanks for your input. My groupings are indeed shared through the batches, although it can be as little as 1 out of 4 or as big as 3 out of 4 only

I have read up on Combat-Seq and it did look compelling, my issue with it was the warnings, as you say, about directly altering the counts.

Do you think it is possible to safely use Combat-Seq before DESeq2 to adjust the counts and get reasonable data out of it for downstream GSEA or would that completely remove a significant result from any further analysis?

ADD REPLY
1
Entering edit mode

It's really not possible for us to guess. You just have to try it out and compare the results. And validate things experimentally once you have testable findings.

The results accounting for the batch in the model design are appropriate for downstream GSEA if you're using the pre-ranked method.

ADD REPLY
0
Entering edit mode

So I just implemented ComBat-Seq, and I sadly ran straight into some issues.

When I run ComBat on the count matrices, it aborts as soon as it hits a group in which a batch has only 1 sample, so I am guessing that throws ComBat out the window anyways.

On the other hand, while trying to feed the output from ComBat_seq to DESeqDataSetFromMatrix, it throws an error "some values in assay are not integers"

I am assuming it doesn't like me giving it coldata the same way as DESeqDataSetFromTximport?

ADD REPLY
1
Entering edit mode

DESeqDataSetFromMatrix needing integers can be solved by just rounding the ComBat_seq output to integers.

ADD REPLY
0
Entering edit mode

Combat-Seq, iirc correctly, should in fact preserve the integer nature of the data, that's the whole point, so something is going wrong here.

When I run ComBat on the count matrices

You should run it on a single matrix that contains all data from all batches, and then provide the batch information to the function.

I am assuming it doesn't like me giving it coldata the same way as DESeqDataSetFromTximport?

It gives you a matrix of corrected counts that are still "raw" and integers. Please show code.

ADD REPLY
0
Entering edit mode

I am terribly sorry if I overlooked something whilst re-writing the code. I am not an (bio)informatician by trade and am self-taught. Your help is much appreciated.

The code:

samples_subset<-subset(samples,group == groupA | group == groupB)
samples_subset$group <- factor(samples_subset$group, levels = c(groupA,groupB))

files <- file.path(data_dir,samples_subset$sample_name, "quant.sf")
names(files) <- samples_subset$sample_name
all(file.exists(files))

txi <- tximport(files, type = "salmon", ignoreTxVersion=TRUE, tx2gene = tx2gene)

combat_batch <- samples_subset$run
combat_group <- samples_subset$group
txi_subset_adjusted <- sva::ComBat_seq(as.matrix(txi$counts) ,batch=combat_batch,group=combat_group)


ddsTxi <- DESeqDataSetFromMatrix(countData = txi_subset_adjusted,
                                   colData = samples_subset,
                                   design = ~ group)

ddsTxi$group <- factor(dds$group, levels = c(groupA,groupB)) #factor level determines which is reference (groupA)

dds <- DESeq(ddsTxi)
ADD REPLY
0
Entering edit mode

So rounding actually did the trick. Sadly I am still stuck in my issue with batches of comparison groups having only 1 sample included and ComBat-Seq aborting the script.

Is there any way to circumvent this without actual additional samples?

ADD REPLY
0
Entering edit mode

No, not that I know of.

ADD REPLY

Login before adding your answer.

Traffic: 1818 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6