Alternative/Incorrect DiffBind outputs for each contrast when providing entire vs. partial datasheet
1
0
Entering edit mode
6 months ago
gkunz ▴ 10

Hello,

I am attempting to use DiffBind to identify differential peaks between groups in my experiment. I have 10 groups in my experiment with 3 replicates per group. This allows for the possibility for 45 unique comparisons. Through the help of the DIffBind vignettes and online resources I have been able to produce code that is functional and returns a list of differential peaks for each comparison that I have established.

However, issues arose when I began to visualize this information in IGV. The differential peaks that were identified by EDGER and DESEQ2 were not supported by the individual peak files. Differential peaks were being called in locations where no peak was identified in any of the replicates in the first place and differential peaks were being called at locations where all replicated are demonstrating peaks with similar scores. I am using the same fasta file for visualization that was used for alignment and peak calling.

In the hopes of solving this issue, I attempted to look closely at only one of my comparisons. I created a data sheet that only contained the necessary information (7.bams [including input], 6 .narrowPeaks) for that comparison and reran the same code without making any notable modifications. This both reduced the number of differential peaks identified by both EDGER and DESEQ2 and lead to the identification of different differential peaks than when I had run the code such that it utilized a datasheet containing all of my sample information. When I visualize this information in IGV, it is appropriate/correct. There are differential peak calls at locations in which there are actual peaks identified in one group but not the other and so on and so forth...

I would like to better understand why this is occurring and if it is to be expected. My initial though would be that providing .bams and .narowPeak files for other groups should not change the output of each contrast, but that does not seem to be the case. Is it the result of a normalization step? Is there something that I can do to make the code that runs all 45 comparisons return the appropriate differential peaks that I was able to obtain by only looking at a single comparison. I would like to avoid having to run all 45 comparisons individually if at all possible.

Recap:

• Samplesheet containing information for single contrast yields appropriate differential peaks
• Samplesheet containing information for all contrasts alters the differential peaks identified for each individual contract such that they appear incorrect upon visualization

I hope that this explanation makes sense. If further detail would be helpful, I am obviously more than happy to provide. Any assistance in addressing this question would be greatly appreciated as I would really prefer to not run each comparison individually.

All the Best!

ChIP-Seq software error DiffBind • 281 views
1
Entering edit mode
6 months ago

We are unlikely to be able to help without seeing your code. My first guess is that you're not providing anything to the peaks argument of the dba.count function. This will result in your peaks from all samples being merged into a consensus peak set that will be used for all comparisons. In your case, this leads to peaks that aren't actually present in your contrast of interest to get labeled as differential due to what's basically random noise.

You can overcome this by either generating your own consensus peakset for each comparison by merging the peaks via bedtools and feeding that to the peaks argument or try to use dba.peakset to add them to your DBA object directly. The function documentation for dba.peakset seems rather confusing to me, but you can probably figure it out with some experimentation.

1
Entering edit mode

Hi Jared,

Thanks for taking the time to respond!

Below is the code I am using to generate differential peaks. It is basically just what is outlined in the DiffBind vignette. You were correct to assume I was not providing anything to the peaks argument of the dba.count function and this does seem to be a sensible explanation for the error I am experiencing. I went ahead and read through the portion of the DiffBind Vignette that focuses on the generation of a consensus peakset and how to use the dba.peakset to generate consensus peaksets for each condition as opposed to a single consensus set. I followed those instructions and introduced the changes into my code below as well. This seems to have essentially fixed my problem. The results are not identical to if I would have run the code for each contrast on its own but, but all of the differential peaks that have been identified by DESEQ2 are at least sensible in that they seem to be supported by the at least two replicates Interestingly, the number of differential peaks identified by EDGER for every contrast far outnumber those identified by DESEQ, although many do overlap. EDGER did still identify some peaks that do not appear sensible and there was some mismatch in the differential peaks identified by DESEQ2 for the individual contrast vs. the entire set of contrasts with provide peak files.

(linking and .img did not seem to work)

It seems that the consensus set is generated from all peaks that appears in more than two of the provided samples. The use of this consensus peakset seemed to introduce some issues into the differential peaks identified when providing DiffBind with more than two conditions. (This is just my interpretation of the problem and may not be totally or even partially correct). Specifying consensus peaksets with the dba.peakset functions seemed to fix these problems, but still lead to different identification than if the individual contrast was run with no consensus peak set specified.

This brings about the question of is there a best method or most appropriate method for designating a consensus peakset? The goal obviously being to get the most reliable set of differential peaks. I am happy in that when I now visualize the peaks they seem appropriate, but how can I be sure that I am not missing peaks that could be classified as significantly differential based on the method used to establish a consensus peakset? This last question is likely more of discussion than a correct or incorrect answer, but I may be wrong about that.

Again, thanks for the assistance Jared, it is greatly appreciated!

suppressMessages(library(tidyverse))
suppressMessages(library(DiffBind))
suppressMessages(library(ChIPQC))
suppressMessages(library(BiocParallel))

setwd("~/H3K27ac Analysis")
samples

dbObj <- dba(sampleSheet=samples)

dbObj_consensus <- dba.peakset(dbObj, consensus = DBA_CONDITION, minOverlap=0.66)
dbObj_consensus <- dba(dbObj_consensus, mask = dbObj_consensus$masks$Consensus, minOverlap=1)
consensus_peaks <- dba.peakset(dbObj_consensus, bRetrieve=TRUE)

dbObj_dba.count <- dba.count(dbObj, bParallel = FALSE)
dbObj_dba.count

dbObj_dba.contrast <- dba.contrast(dbObj_dba.count, categories=DBA_CONDITION, minMembers = 2)
dbObj_dba.contrast

dbObj_dba.analyze <- dba.analyze(dbObj_dba.contrast, method = DBA_ALL_METHODS)
dbObj_dba.analyze

1
Entering edit mode

Good to hear about the progress. It's tricky to determine a good consensus peakset. Section 6.1 in the DiffBind vignette shows how to take a look at the proportion of overlapping peaks between samples, which can be helpful. It's going to depend on your data, how similar the samples are, and how much noise you're willing to deal with vs the risk of losing true positives.

This is one of the drawbacks of using peaks in general, and there are other approaches that avoid them by using a sliding window approach for comparison, like csaw.