Question

Differential ChIP-seq with csaw: How to normalise counts on repetitive regions (telomers)?

1

Entering edit mode

6.6 years ago

gil.hornung ▴ 100

Hi,

I am interested in H3K9me2 signal (in S. Pombe), which is abundant on telomeres and centromeres. These regions are notorious for being highly repetitive. I clearly see an effect between treated and control samples in the coverage on telomeres, and I would like to quantify these differences using csaw. However, I think that he default normalisation (TMM) is problematic, because if I have more signal from the telomeres, then I also have more multi-mapping reads (because they fall on repeats), the multi-mappers are not counted, which affects the the over-all count normalisation.

Any thoughts on how to solve this? Maybe skip TMM and divide the counts by the total number of mapped reads (not just the uniquely mapped)?

Thanks,

Gil Hornung

ChIP-Seq csaw • 2.0k views

ADD COMMENT • link 6.6 years ago by gil.hornung ▴ 100

score 1 · Answer 1 · 2017-09-21

This is the answer I got from Aaron Lun on bioconductor:

The multi-mapping reads (or lack thereof) should not affect how TMM normalization behaves. The assumption of TMM normalization on binned counts is that most regions of the genome are not marked, i.e., background, and all background regions are not DB between conditions. The normalization factors are subsequently computed to remove any systematic differences in the background counts between samples. Such differences are empirical, so the normalization will automatically account for the fact that multi-mapping reads are not counted.

Of course, this assumes that you're using the same set of reads for normalization as you are for the rest of the DB analysis. You shouldn't be computing normalization factors with the uniquely-mapped reads and then performing the rest of the analysis with multi-mapping reads (i.e., use the same readParam object). I'm also assuming you're using bins over the entire genome, don't just use the telomere regions for normalization.

There are also probably other issues you should consider. For example, if the telomere length changes between control or treatment, changes in marking would be confounded with changes in coverage due to copy number. This would be pretty hard to resolve from the ChIP-seq data, you'd probably need some other technique. There may also be some other biases, e.g., differences in IP efficiency between control and treatment conditions (this can be checked by ensuring that other marked non-telomere loci are not DB).