Question

Which method is the best for using in "dba.count" in Diffbind R package

0

Entering edit mode

4.6 years ago

m.sadman.sakib ▴ 120

I am analyzing ChIP seq data using diffbind. In dba.count(), there are many parameters, including many scoring functions as follows:

DBA_SCORE_READS raw read count for interval using only reads from ChIP

DBA_SCORE_READS_FOLD raw read count for interval from ChIP divided by read count for interval from control

DBA_SCORE_READS_MINUS raw read count for interval from ChIP minus read count for interval from control

DBA_SCORE_RPKM RPKM for interval using only reads from ChIP

DBA_SCORE_RPKM_FOLD RPKM for interval from ChIP divided by RPKM for interval from control

DBA_SCORE_TMM_READS_FULL TMM normalized (using edgeR), using ChIP read counts and Full Library size

DBA_SCORE_TMM_READS_EFFECTIVE TMM normalized (using edgeR), using ChIP read counts and Effective Library size

DBA_SCORE_TMM_MINUS_FULL TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Full Library size

DBA_SCORE_TMM_MINUS_EFFECTIVE TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Effective Library size

DBA_SCORE_TMM_READS_FULL_CPM same as DBA_SCORE_TMM_READS_FULL, but reported in counts-per-million.

DBA_SCORE_TMM_READS_EFFECTIVE_CPM same as DBA_SCORE_TMM_READS_EFFECTIVE, but reported in counts-per-million.

DBA_SCORE_TMM_MINUS_FULL_CPM same as DBA_SCORE_TMM_MINUS_FULL, but reported in counts-per-million.

DBA_SCORE_TMM_MINUS_EFFECTIVE_CPM same as DBA_SCORE_TMM_MINUS_EFFECTIVE, but reported in counts-per-million.

DBA_SCORE_SUMMIT summit height (maximum read pileup value)

DBA_SCORE_SUMMIT_ADJ summit height (maximum read pileup value), normalized to relative library size

DBA_SCORE_SUMMIT_POS summit position (location of maximum read pileup)

As a naive user, my question is, which method is the best to generate counts from ChIP seq data? Also, if you do not select anything, which is the default behaviour? Thank you in advance!

ChIP-Seq diffbind • 5.1k views

ADD COMMENT • link updated 5 months ago by Rory Stark ★ 2.0k • written 4.6 years ago by m.sadman.sakib ▴ 120

1

Entering edit mode

Difficult to tell without details on your experiment. As a beginner it is typically recommended to extensively read the manual and leave everything at default until you have the experience and undertanding to change options.

ADD REPLY • link 4.5 years ago by ATpoint 81k

0

Entering edit mode

4.5 years ago

Rory Stark ★ 2.0k

One important thing to note is that the score computed by dba.count() is only used for plotting the entire binding matrix. The values used for the differential analysis (using dba.analyze()) are determined at analysis time based on the values of certain parameters (method, bSubControl, and bFullLibrarySize).

The default score is DBA_SCORE_TMM_MINUS_FULL (as described in the help page for dba.count()), but this is only used for global plots. This score represents TMM normalized read counts after the control reads have been subtracted. Using something like DBA_SCORE_RPKM gives what are probably the least "biased" scores for use in these plots.

ADD COMMENT • link 4.5 years ago by Rory Stark ★ 2.0k

0

Entering edit mode

Hi Rory:

So it seems like:

dba.count() will do normalisation (for plotting purpose only)
dba.normalisation() will also do normalisation (with a lot of options).
dba.analysis() will also do normalisation (for differential analysis purposes).

So what's the difference between these 3 function's normalizations? Or maybe I misunderstood some of them.

What I really want it to retrive a "matrix" similar as RNA-seq gene expression matrix, DNA methylation array matrix that "values" in the matrix are comparable between-sample. More specifically, assuming I have 5 normal sample ChIP-seq, 5 cancer sample ChIP-seq. I want to have a nrow(consensus) X 10 matrix to do work like:

Draw boxplot for differential enriched genes, I can compare TSS enriched peak binding value.
Correlate certain peak's binding value with sample ages .etc
Integrate with other omic data.
...

In this situation, should I use dba.peakset(myDBA,bRetrieve=TRUE) after dba.count(), or dba.report() after dba.analysis()?

Sorry for asking so many questions, I am totally confused by normalisation in Diffbind, seems it's everywhere in each function.

Best Tian

ADD REPLY • link 2.8 years ago by Tian ▴ 50

0

Entering edit mode

Hi Rory, thanks for your answer.

I use diffBind for ATAC-seq. I wish to generate PCA plots for all samples. May I ask which "score" is appropriate, DBA_SCORE_RPKM or default DBA_SCORE_TMM_MINUS_FULL?

Thanks!

ADD REPLY • link 5 months ago by Wang Cong ▴ 10

0

Entering edit mode

In the current versions of DiffBind, using score=DBA_SCORE_NORMALIZED (the default) is usually the right answer here as it will plot/report using whatever normalization is being used in the analysis. By default, this is a "light" normalization, with the raw counts adjusted relative to the library size (total number of aligned reads in the associated bam file).

ADD REPLY • link 5 months ago by Rory Stark ★ 2.0k

score 6 · Accepted Answer · 2021-07-20

6

Entering edit mode

2.8 years ago

Rory Stark ★ 2.0k

Normalization in DiffBind has evolved since the original answer a couple of years ago, especially since version 3.0.

The default is now to compute the same normalization factors in all the cases your mention. Basically, dba.normalize() is invoked by default in dba.count(), and these values are used for plotting, retrieving the count matrix, and for running analyses.

The default normalization factors (based solely on library sizes) can be over-ridden by an explicit call to dba.normalize(), and the updated normalized counts will be used for plotting and matrix retrieval.

The score used for plotting and count matrix retrieval can be changed from the normalized counts (score=DBA_SCORE_NORMALIZED) using dba.count() with peaks=NULL and setting the score parameter. The available score values, as outlined in the original question, have also been updated. dba.analyze() will always use the computed normalization factors regardless of what the score parameter is set to.

So, from version 3.0 onwards, you should be able to simply run dba.count() and have consistent, normalized counts used for all plotting, retrieval, and analysis functions. If you change the normalization parameters using dba.normalize(), this will be reflected everywhere as well.

ADD COMMENT • link 2.8 years ago by Rory Stark ★ 2.0k

0

Entering edit mode

Thanks, what a nice clear answer!

ADD REPLY • link 2.8 years ago by Tian ▴ 50

0

Entering edit mode

Thank you very much, Rory!

ADD REPLY • link 2.7 years ago by m.sadman.sakib ▴ 120

0

Entering edit mode

Thank you for all the helpful advice and explanations. Can you further explain this part:

"The default normalization factors (based solely on library sizes) can be over-ridden by an explicit call to dba.normalize(), and the updated normalized counts will be used for plotting and matrix retrieval."

That is, how do I explicitly call dba.normalize() to override the normalization technicque in dba.count().

When I do dba.normalize() first, I get an error that there are "No samples present with read counts".

res_norm <- dba.normalize(DBA, normalize = DBA_NORM_LIB, library = DBA_LIBSIZE_FULL,
                              background = FALSE, spikein = FALSE)

res_norm <- dba.count(res_norm, peaks = peaks, summits = summits, 
                          bParallel = FALSE)

ADD REPLY • link 10 months ago by Jo • 0

0

Entering edit mode

If you are calling dba.normalize() directly, it must be called after calling dba.count(). dba.count() will apply the default normalization, but the subsequent call to dba.normalize() will apply the new normalization parameters you specify.

Note that in your example code, the parameters you specified to dba.normalize() are the same as is used for the default normalization.

ADD REPLY • link 10 months ago by Rory Stark ★ 2.0k