How to properly organize a DiffBind analysis to normalize data across conditions (ChIP-seq)
1
0
Entering edit mode
4.9 years ago
msimmer92 ▴ 300

Hello, I was wondering how the DiffBind normalization works when you have different conditions, and what's the proper way to organize the analysis according to that, since I noticed certain differences (I'll illustrate with an example).

Imagine I have the following groups of replicates (ChIP-seq data, humans): - Studying ChIP-seq of certain factor in a human disease -

Samples from Cell type A (with e.g. 9 replicates for each condition A1, A2... AH).

  • A1 - stage1 of disease
  • A2 - stage2 of disease
  • A3 - stage3 of disease
  • AH - healthy indiviauals (as control condition)

Samples from Cell type B (as well, each of them with replicates)

  • B1 - stage1 of disease
  • B2 - stage2 of disease
  • B3 - stage3 of disease
  • BH - healthy individuals

Not only we have disease-control but also we have different cell types (so one more level of comparison).

There are two ways to do this analysis:

  • I can run my diffbind script two times (one for samples A and another one for samples B), or
  • I can concatenate the samples and make only one run with all together (the contrast can be done anyways pairwise between A1 and AH, A2, AH... etc).

I was wondering if one way is better than the other because, since it's data from the same disease, I would like to compare between cell types A and B. For this, they have to be normalized properly. Even though I select the same normalization in both runs (the default in this case, DBA_SCORE_TMM_MINUS_FULL) maybe when I run all together the normalization takes into account things that the separate runs don't. I did both ways for an example of mine and I noticed a slight difference in one of the stages, hence I wondered which is the difference and which one is more correct.

Does someone have a deeper understanding of the inner workings of this package and can answer this? Thank you !

diffbind R ChIP-Seq • 2.5k views
ADD COMMENT
1
Entering edit mode

If you are only interested to compare within A and within B I would go with option 1. The assumption behind the DESeq2 normalization factors (which is what you probably use) is that most regions are not differentially bound. You can check with MA plots to get an idea. I have a simple wrapper script for MA-plots if you are interedted. Given the assumption is true and you only want to compare within one cell type, adding the second to the calculations probably adds unnecessary variation. You can use PCA or MDS to see how the samples cluster. I guess the main confounding factor will be cell type followed by disease/healthy.

ADD REPLY
0
Entering edit mode

Thank you for the script, I will check it out ! But what about if I am interested in comparing A vs B as well? (this is my main concern, that I can compare A and B being sure all A and B samples are properly normalized). For example, make MA and volcano plots for the pairwise comparisons of conditions within A and B and then show everything and say things like "in cell type A we can see there's a lot of change and certain trend , but in B we don't see it". Things like that, that I want to be very sure there's no bias (introduced by how I run the scripts, as I told you) before concluding.

ADD REPLY
1
Entering edit mode

I would in this case see how the MA plots look when you compare the averaged samples (so replicates of one group vs replicates of the other group). If you assume that most regions do not change and normalization went properly you should see the bulk of datapoints being somewhat spread across the x-axis at y~0. You can have a look at the csaw vignette which has a section about normalization and is pretty informative.

ADD REPLY
0
Entering edit mode

(sorry for the bunch of questions, I don't know if I should make another post about only this?) I also have a question about interpreting the pairwise MA plot that's produced by diffbind (dba.plotMA()). I'm new to this type of plots, so I showed it to a mathematician that I know. He said he would expect that, if normalization went correctly, the blue cloud (density plot at the background) is symmetrical around the X axis, whereas in the diffbind examples and in my dataset is asymmetrical (and you can even draw some conclusions if it's more to one or the other side - I mean, it's even informative). He said this but we both acknowledge that the MA plot in diffbind might be a variant, done in such a way that is not necessarily wrong because of that. He also acknowledged that he doesn't have experience with this package. Because I am not an expert in the inner workings of this specific MA plot, I was wondering if this is true and applies to the dba.MAplots (the thing about the symmetry that we expect, if normalization is correct).

ADD REPLY
0
Entering edit mode

Another question: yes, I use DEseq2. Do you recommend it or EdgeR?

ADD REPLY
1
Entering edit mode

I think it could be easier to discuss if you join the biostars slack biostar.slack.com: Chat for the biostars community. We have a help section there. I personally use csaw which uses edgeR internally but both are well-maintained tools. It is probably a matter of personal preference.

ADD REPLY
1
Entering edit mode
4.9 years ago
ATpoint 81k

Towards the MA-plot question: An MA plot shows on the x-axis the average counts of the two conditions you compare and on the y-axis the log2FC. Here is an example from RNA-seq where one of the samples has overall lower counts than the other (simply because fewer total read count). It is a cell-line replicate.

The basic code is simple:

M <- log2(Sample1/Sample2)         ## y-axis
A <- 0.5*log2(Sample1*Sample2)     ## x-axis
plot(A,M)
abline(h=0)

enter image description here

Given that most genes/regions/peaks do not change, the bulk of data points will scattered around be at y~0 along the x-axis. In this case the TMM-normalization (from edgeR) correctly scaled the two datasets to be centered at y ~ 0. You see that genes with low counts (so on the left of the plot) always have higher fold cahnges than high-count genes. This is called mean-variance dependency. These FCs are typically artifacts based on the low counts. This issustrates why it is so important to have replicates and proper statistics to get rid of these false enrichments which typically are not significant.

ADD COMMENT

Login before adding your answer.

Traffic: 1850 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6