help for dba.count settings for RNAPII-ChIP using GRanges peakset
1
0
Entering edit mode
5 months ago
bertb ▴ 10

Hello,

I am trying to analyze Pol2 ChIPseq data, and the standard MACS2/peakcounting pipeline doesn't seem to capture what I'm after, since there's binding all over the promoter and coding region. That is, I'm less interested in finding the highest point and tweaking the summits factor, and more interested in the overall concentration differences (and profile changes) across these gene bodies/promoters...kind of like an RNAseq where I keep duplicates but worry more about normalization.

What I'd like to do is feed dba.count with a GRanges (transcripts) peakset, but I want to make sure I get the settings right. Would something like this work?:

DBA <- dba(sampleSheet=sample.sheet.csv)
#sampleSheet is standard with bamReads, bamControl (and macs files)
GR <- transcripts(tdxb)

counts <- dba.count(DBA, GR)


Should summits be set to FALSE? Maybe I'm missing something more fundamental.

DiffBind ChIPseq • 412 views
1
Entering edit mode
5 months ago
Rory Stark ★ 1.6k

Supplying the annotated regions and setting 'summits=FALSE' when calling dba.count() will, in principle, work. A few things worth noting:

• The transcripts() include a lot of overlapping regions which will be merged, so the binding matrix will be constructed with fewer rows than there are regions in the transcripts().
• The default behavior of filtering regions with low read counts in all samples will likely further reduce the number of regions in the final binding matrix. This is desirable as some of the annotated regions may not be enriched in your ChIP.
• You can retrieve the regions actually interrogated by calling dba.peakset() with bRetrieve=TRUE after the call to dba.count():

merged_filtered_regions <- dba.peakset(counts, bRetrieve=TRUE)

• Specifically, you may want to confirm that the merging/filtering process doesn't alter the distribution of widths too much. You can do this by comparing summary(width(GR)) with summary(width(merged_filter_regions))

• This can also be run using the summits parameter. This will take an enriched "sample" of each transcript region for comparison and identify regions where this sample is differentially enriched. This is less likely to be thrown off by including very large regions that may contain a larger fraction of "background" reads.

0
Entering edit mode

When it comes to normalization, I've reviewed the user guide, as well as your Bioconductor workshops in 2020, and I'm still just having a bit of difficulty understanding the difference between the default full library normalization ("lib"), and background normalization (normalize=DBA_NORM_NATIVE, background=TRUE). Technically, I understand the differences in that background normalization utilizes native normalization methods like taking the median of modes from large bins, rather than just scaling for depth.

But strategically, both seem to aim to "level" the background between all samples, based on the assumption that backgrounds should be largely similar across samples. I guess my question is, why might one choose background normalization instead of the default ("lib")? Running both normalizations, I do seem to be getting differences in the number of significantly bound sites in my analyses.

Is there a way to view the binding matrix after each normalization method to see how these normalization methods are affecting specific regions?

Thank you!