Which method is the best for using in "dba.count" in Diffbind R package
Entering edit mode
3.5 years ago

I am analyzing ChIP seq data using diffbind. In dba.count(), there are many parameters, including many scoring functions as follows:

  1. DBA_SCORE_READS raw read count for interval using only reads from ChIP
  2. DBA_SCORE_READS_FOLD raw read count for interval from ChIP divided by read count for interval from control
  3. DBA_SCORE_READS_MINUS raw read count for interval from ChIP minus read count for interval from control
  4. DBA_SCORE_RPKM RPKM for interval using only reads from ChIP
  5. DBA_SCORE_RPKM_FOLD RPKM for interval from ChIP divided by RPKM for interval from control
  6. DBA_SCORE_TMM_READS_FULL TMM normalized (using edgeR), using ChIP read counts and Full Library size
  7. DBA_SCORE_TMM_READS_EFFECTIVE TMM normalized (using edgeR), using ChIP read counts and Effective Library size
  8. DBA_SCORE_TMM_MINUS_FULL TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Full Library size
  9. DBA_SCORE_TMM_MINUS_EFFECTIVE TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Effective Library size
  10. DBA_SCORE_TMM_READS_FULL_CPM same as DBA_SCORE_TMM_READS_FULL, but reported in counts-per-million.
  11. DBA_SCORE_TMM_READS_EFFECTIVE_CPM same as DBA_SCORE_TMM_READS_EFFECTIVE, but reported in counts-per-million.
  12. DBA_SCORE_TMM_MINUS_FULL_CPM same as DBA_SCORE_TMM_MINUS_FULL, but reported in counts-per-million.
  13. DBA_SCORE_TMM_MINUS_EFFECTIVE_CPM same as DBA_SCORE_TMM_MINUS_EFFECTIVE, but reported in counts-per-million.
  14. DBA_SCORE_SUMMIT summit height (maximum read pileup value)
  15. DBA_SCORE_SUMMIT_ADJ summit height (maximum read pileup value), normalized to relative library size
  16. DBA_SCORE_SUMMIT_POS summit position (location of maximum read pileup)

As a naive user, my question is, which method is the best to generate counts from ChIP seq data? Also, if you do not select anything, which is the default behaviour? Thank you in advance!

ChIP-Seq diffbind • 3.3k views
Entering edit mode

Difficult to tell without details on your experiment. As a beginner it is typically recommended to extensively read the manual and leave everything at default until you have the experience and undertanding to change options.

Entering edit mode
20 months ago
Rory Stark ★ 1.7k

Normalization in DiffBind has evolved since the original answer a couple of years ago, especially since version 3.0.

The default is now to compute the same normalization factors in all the cases your mention. Basically, dba.normalize() is invoked by default in dba.count(), and these values are used for plotting, retrieving the count matrix, and for running analyses.

The default normalization factors (based solely on library sizes) can be over-ridden by an explicit call to dba.normalize(), and the updated normalized counts will be used for plotting and matrix retrieval.

The score used for plotting and count matrix retrieval can be changed from the normalized counts (score=DBA_SCORE_NORMALIZED) using dba.count() with peaks=NULL and setting the score parameter. The available score values, as outlined in the original question, have also been updated. dba.analyze() will always use the computed normalization factors regardless of what the score parameter is set to.

So, from version 3.0 onwards, you should be able to simply run dba.count() and have consistent, normalized counts used for all plotting, retrieval, and analysis functions. If you change the normalization parameters using dba.normalize(), this will be reflected everywhere as well.

Entering edit mode

Thanks, what a nice clear answer!

Entering edit mode

Thank you very much, Rory!

Entering edit mode
3.5 years ago
Rory Stark ★ 1.7k

One important thing to note is that the score computed by dba.count() is only used for plotting the entire binding matrix. The values used for the differential analysis (using dba.analyze()) are determined at analysis time based on the values of certain parameters (method, bSubControl, and bFullLibrarySize).

The default score is DBA_SCORE_TMM_MINUS_FULL (as described in the help page for dba.count()), but this is only used for global plots. This score represents TMM normalized read counts after the control reads have been subtracted. Using something like DBA_SCORE_RPKM gives what are probably the least "biased" scores for use in these plots.

Entering edit mode

Hi Rory:

So it seems like:

  1. dba.count() will do normalisation (for plotting purpose only)
  2. dba.normalisation() will also do normalisation (with a lot of options).
  3. dba.analysis() will also do normalisation (for differential analysis purposes).

So what's the difference between these 3 function's normalizations? Or maybe I misunderstood some of them.

What I really want it to retrive a "matrix" similar as RNA-seq gene expression matrix, DNA methylation array matrix that "values" in the matrix are comparable between-sample. More specifically, assuming I have 5 normal sample ChIP-seq, 5 cancer sample ChIP-seq. I want to have a nrow(consensus) X 10 matrix to do work like:

  • Draw boxplot for differential enriched genes, I can compare TSS enriched peak binding value.
  • Correlate certain peak's binding value with sample ages .etc
  • Integrate with other omic data.
  • ...

In this situation, should I use dba.peakset(myDBA,bRetrieve=TRUE) after dba.count(), or dba.report() after dba.analysis()?

Sorry for asking so many questions, I am totally confused by normalisation in Diffbind, seems it's everywhere in each function.

Best Tian


Login before adding your answer.

Traffic: 999 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6