Question

How to properly organize a DiffBind analysis to normalize data across conditions (ChIP-seq)

0

Entering edit mode

4.9 years ago

msimmer92 ▴ 300

Hello, I was wondering how the DiffBind normalization works when you have different conditions, and what's the proper way to organize the analysis according to that, since I noticed certain differences (I'll illustrate with an example).

Imagine I have the following groups of replicates (ChIP-seq data, humans): - Studying ChIP-seq of certain factor in a human disease -

Samples from Cell type A (with e.g. 9 replicates for each condition A1, A2... AH).

A1 - stage1 of disease
A2 - stage2 of disease
A3 - stage3 of disease
AH - healthy indiviauals (as control condition)

Samples from Cell type B (as well, each of them with replicates)

B1 - stage1 of disease
B2 - stage2 of disease
B3 - stage3 of disease
BH - healthy individuals

Not only we have disease-control but also we have different cell types (so one more level of comparison).

There are two ways to do this analysis:

I can run my diffbind script two times (one for samples A and another one for samples B), or
I can concatenate the samples and make only one run with all together (the contrast can be done anyways pairwise between A1 and AH, A2, AH... etc).

I was wondering if one way is better than the other because, since it's data from the same disease, I would like to compare between cell types A and B. For this, they have to be normalized properly. Even though I select the same normalization in both runs (the default in this case, DBA_SCORE_TMM_MINUS_FULL) maybe when I run all together the normalization takes into account things that the separate runs don't. I did both ways for an example of mine and I noticed a slight difference in one of the stages, hence I wondered which is the difference and which one is more correct.

Does someone have a deeper understanding of the inner workings of this package and can answer this? Thank you !

diffbind R ChIP-Seq • 2.5k views

ADD COMMENT • link updated 4.9 years ago by ATpoint 81k • written 4.9 years ago by msimmer92 ▴ 300

1

Entering edit mode

If you are only interested to compare within A and within B I would go with option 1. The assumption behind the DESeq2 normalization factors (which is what you probably use) is that most regions are not differentially bound. You can check with MA plots to get an idea. I have a simple wrapper script for MA-plots if you are interedted. Given the assumption is true and you only want to compare within one cell type, adding the second to the calculations probably adds unnecessary variation. You can use PCA or MDS to see how the samples cluster. I guess the main confounding factor will be cell type followed by disease/healthy.

ADD REPLY • link 4.9 years ago by ATpoint 81k

0

Entering edit mode

Thank you for the script, I will check it out ! But what about if I am interested in comparing A vs B as well? (this is my main concern, that I can compare A and B being sure all A and B samples are properly normalized). For example, make MA and volcano plots for the pairwise comparisons of conditions within A and B and then show everything and say things like "in cell type A we can see there's a lot of change and certain trend , but in B we don't see it". Things like that, that I want to be very sure there's no bias (introduced by how I run the scripts, as I told you) before concluding.

ADD REPLY • link 4.9 years ago by msimmer92 ▴ 300

1

Entering edit mode

I would in this case see how the MA plots look when you compare the averaged samples (so replicates of one group vs replicates of the other group). If you assume that most regions do not change and normalization went properly you should see the bulk of datapoints being somewhat spread across the x-axis at y~0. You can have a look at the csaw vignette which has a section about normalization and is pretty informative.

ADD REPLY • link 4.9 years ago by ATpoint 81k

0

Entering edit mode

(sorry for the bunch of questions, I don't know if I should make another post about only this?) I also have a question about interpreting the pairwise MA plot that's produced by diffbind (dba.plotMA()). I'm new to this type of plots, so I showed it to a mathematician that I know. He said he would expect that, if normalization went correctly, the blue cloud (density plot at the background) is symmetrical around the X axis, whereas in the diffbind examples and in my dataset is asymmetrical (and you can even draw some conclusions if it's more to one or the other side - I mean, it's even informative). He said this but we both acknowledge that the MA plot in diffbind might be a variant, done in such a way that is not necessarily wrong because of that. He also acknowledged that he doesn't have experience with this package. Because I am not an expert in the inner workings of this specific MA plot, I was wondering if this is true and applies to the dba.MAplots (the thing about the symmetry that we expect, if normalization is correct).

ADD REPLY • link 4.9 years ago by msimmer92 ▴ 300

0

Entering edit mode

Another question: yes, I use DEseq2. Do you recommend it or EdgeR?

ADD REPLY • link 4.9 years ago by msimmer92 ▴ 300

1

Entering edit mode

I think it could be easier to discuss if you join the biostars slack biostar.slack.com: Chat for the biostars community. We have a help section there. I personally use csaw which uses edgeR internally but both are well-maintained tools. It is probably a matter of personal preference.

ADD REPLY • link 4.9 years ago by ATpoint 81k

score 1 · Answer 1 · 2019-05-27

Towards the MA-plot question: An MA plot shows on the x-axis the average counts of the two conditions you compare and on the y-axis the log2FC. Here is an example from RNA-seq where one of the samples has overall lower counts than the other (simply because fewer total read count). It is a cell-line replicate.

The basic code is simple:

M <- log2(Sample1/Sample2)         ## y-axis
A <- 0.5*log2(Sample1*Sample2)     ## x-axis
plot(A,M)
abline(h=0)

enter image description here

Given that most genes/regions/peaks do not change, the bulk of data points will scattered around be at y~0 along the x-axis. In this case the TMM-normalization (from edgeR) correctly scaled the two datasets to be centered at y ~ 0. You see that genes with low counts (so on the left of the plot) always have higher fold cahnges than high-count genes. This is called mean-variance dependency. These FCs are typically artifacts based on the low counts. This issustrates why it is so important to have replicates and proper statistics to get rid of these false enrichments which typically are not significant.