Question: Difference of DiffBind library size normalisation vs DESeq2 library size normalisation
gravatar for bioinfouser
8 weeks ago by
bioinfouser70 wrote:

For ChIPseq analysis, I was using DiffBind until now but want to switch to deseq2 as I want to control for multiple covariates, which is not currently offered by DiffBind package AFAIK(only one blocking factor at a time, even though I concatenate the blocking factors). I have the raw counts generated from the bams with reference to MACS2 peaks, and I can do the full analysis, however, I have a question regarding library size normalisation which is below:

DiffBind by default, does it. DESeq2 vignette also suggests, it does library size normalisation by default. But the difference that I find is, Diffbind takes the library size information from the BAM files and uses that, which is probably total mapped reads in the BAM files. In terms of DESeq2, since it doesn't have the bams, it probably do the colum wise sample read count sums to get the library size. Now these total read sums would be the read counts of only those portions detected by MACS2, but not the whole bam file, right? Fundamentally, will they be different or not? I can imagine, there might be reads in the bam files that are not detected by macs2, so I will not have the counts generated by, say, featureCounts. I would really appreciate if the community can comment on this!

Also, when Diffbind does this default normalisation (bFullLibrarysize = T) by default, then invokes the DESeq2 to do differential analysis, deseq2 there also does its own normalisation. Then when someone is using DiffBind package, does the count matrix gets two times normalised by the library size? Once from Bam read counts (DiffBind), again from total counts(DESeq2)?

My main point is, can I trust the DESeq2 library size normalisation method as opposed to Diffbind way of library size normalisation? And use DESeq2 only for analysing my data instead of Diffbind?

One probable solution could be, in DESeq2, feeding the total mapped read numbers as an extra column and keep it as continuous variable, and incorporate that column in design matrix. Does it sound logical? Has anyone done this like that?

Thank you again for taking your time to read my post! Stay safe!

ADD COMMENTlink modified 8 weeks ago by Rory Stark660 • written 8 weeks ago by bioinfouser70
gravatar for Rory Stark
8 weeks ago by
Rory Stark660
University of Cambridge, Cancer Research UK - Cambridge Institute
Rory Stark660 wrote:

When bFullLibrarysize = TRUE, DiffBind bypasses the DESeq2 normalization and performs a simple normalization based on the relative number of reads in each of the BAM files. This is not "re-normalized" a second time when DESeq2 is invoked.

ADD COMMENTlink written 8 weeks ago by Rory Stark660

Dear Rory,

Thank you very much for your swift reply! I understand now completely! I will do my analysis accordingly. One last question regarding normalisation, if I use DESeq2 normalization for ChIPseq analysis, would it equal to DiffBind simple normalisation in terms of the results?

Also, I know until now that Diffbind cannot use multiple blocking factor, and you probably suggested(I cannot find the post now, sorry!) to use other softwares(like DESeq2 directly) to model the covariates of complex experimental designs and do differential binding analysis. But I really love using DiffBind and I think it is a fantastic package for ChIPSeq analysis, like a swiss-army knife! Will there be a future update of DiffBind that might include these functions of modelling complex experimental design?

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by bioinfouser70

Dear Rory, one more question. this default library size normalisation in DiffBind is done of the raw counts, or?

ADD REPLYlink written 6 weeks ago by bioinfouser70

Default is normalize counts adjusted as follows:

 max(chip_counts - control_counts,1)
ADD REPLYlink modified 24 days ago • written 24 days ago by Rory Stark660
gravatar for Asaf
8 weeks ago by
Asaf8.0k wrote:

The assumption behind DESeq2 normalization is that most of the entities (peaks in your case) are the same across all samples. If you think this assumption is correct then you can trust DESeq2 normalization. If you have a set of peaks that you assume will be more stable you can give this list to DESeq2 to normalize using these peaks.

ADD COMMENTlink written 8 weeks ago by Asaf8.0k

Thanks a lot for responding so quickly! Could you please elaborate on the library size normalisation question that I had? Say, in DESeq2, the library is the colSums of raw counts. But DiffBind uses probably total mapped reads from the bam file. These are fundamentally different library sizes. Then, how would the results might be affected?

ADD REPLYlink written 8 weeks ago by bioinfouser70

DESeq2 does not normalize by library size. Roughly, it compares the values of each gene (or peak) between two samples and takes the median value as normalization factor.

ADD REPLYlink written 8 weeks ago by Asaf8.0k

I see. I might be wrong then! But this is mentioned in the DESeq2 vignette:

The DESeq2 model internally corrects for library size, so transformed or normalized values such as counts scaled by library size should not be used as input.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by bioinfouser70

It does correct but not by dividing by total number of reads

ADD REPLYlink written 8 weeks ago by Asaf8.0k

See this thread: Can someone please explain in simple terms how DESeq2 works?

ADD REPLYlink written 8 weeks ago by Asaf8.0k

I will go through the link. Thanks!

ADD REPLYlink written 8 weeks ago by bioinfouser70
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 967 users visited in the last hour