Question

Normalizing NGS samples expected to have different total reads

2

Entering edit mode

5.5 years ago

mbk0asis ▴ 700

Hello!

I have MBD-seq datasets of Dnmt1 Knock-Out and control cells.

As expected, Dnmt1 KO covered very few genomic regions compared to control sample since no Dnmt1 presented in the KO. The problem is that the reads in the KO sample were concentrated in the "few genomic region" made the intensity too high.

What I'm curious here is how I should normalize such data which the samples are expected to be different overall reads?

For example, if KO and control have 100 and 10,000 detected bins (> 0 reads), and each of them has a million total reads, each bin in KO will 100 times higher reads leading to the biased quantification of MBD-seq enrichment.

Would it be OK if I subset 1/100 of the reads from a KO sample to compensate the differences?

How do everyone think?

Thank you!

NGS Normalization • 1.9k views

ADD COMMENT • link 5.5 years ago by mbk0asis ▴ 700

score 3 · Accepted Answer · 2020-01-13

3

Entering edit mode

5.5 years ago

ATpoint 88k

I suggest you go through the manual of MEDIPS https://bioconductor.org/packages/release/bioc/html/MEDIPS.html to get a guide line for the analysis. It also covers normalization.

ADD COMMENT • link 5.5 years ago by ATpoint 88k

0

Entering edit mode

Thank you for the advice! I will check that out.

ADD REPLY • link 5.5 years ago by mbk0asis ▴ 700

0

Entering edit mode

I found the a sentence, "It has been proposed that quantile normalization can correct for varying DNA enrichment efficiencies." in the MEDIPS vignette. However, this seems different from my case.

I think the quantile normalization would be useful when there is difference in sequencing depth over the genome between samples, but, in my case, I expected the KO would be different from control (much less sequencing data from KO) because much less genomic regions would had been methylated.

To compensate the biases raised from differential genomic coverage while conserving the differential genomic coverage, I improvised the normalization method.

(1) I calculated the read enrichment in CPM (counts per million reads) in 500 bp bins over the genome.

(2) I counted the number of bins containing more than one read.

(3) I divided the CPM values in each sample by the bin numbers in (2) - denoted as CPBM (counts per bins per million reads).

What do everybody think? Do you think this normalization is acceptable?

Any comments will be appreciated!

Thnak you!

ADD REPLY • link 5.5 years ago by mbk0asis ▴ 700

0

Entering edit mode

This sort of scenario is really worst-case for MeDIP. The best you can do is find those regions with some notable amount of coverage in the KO samples and normalize only using them. This presumes, of course, that those regions weren't affected by the KO, which may not be true, but it's the best that you can do.

In general, I expect a strong global change in your data, since that's what we've seen in DNMT KO samples using WGBS.

ADD REPLY • link 5.5 years ago by Devon Ryan 105k