Question: Normalizing NGS samples expected to have different total reads
gravatar for mbk0asis
10 months ago by
Korea, Republic Of
mbk0asis570 wrote:


I have MBD-seq datasets of Dnmt1 Knock-Out and control cells.

As expected, Dnmt1 KO covered very few genomic regions compared to control sample since no Dnmt1 presented in the KO. The problem is that the reads in the KO sample were concentrated in the "few genomic region" made the intensity too high.

What I'm curious here is how I should normalize such data which the samples are expected to be different overall reads?

For example, if KO and control have 100 and 10,000 detected bins (> 0 reads), and each of them has a million total reads, each bin in KO will 100 times higher reads leading to the biased quantification of MBD-seq enrichment.

Would it be OK if I subset 1/100 of the reads from a KO sample to compensate the differences?

How do everyone think?

Thank you!

ngs normalization • 534 views
ADD COMMENTlink modified 10 months ago • written 10 months ago by mbk0asis570
gravatar for ATpoint
10 months ago by
ATpoint42k wrote:

I suggest you go through the manual of MEDIPS to get a guide line for the analysis. It also covers normalization.

ADD COMMENTlink written 10 months ago by ATpoint42k

Thank you for the advice! I will check that out.

ADD REPLYlink written 10 months ago by mbk0asis570

I found the a sentence, "It has been proposed that quantile normalization can correct for varying DNA enrichment efficiencies." in the MEDIPS vignette. However, this seems different from my case.

I think the quantile normalization would be useful when there is difference in sequencing depth over the genome between samples, but, in my case, I expected the KO would be different from control (much less sequencing data from KO) because much less genomic regions would had been methylated.

To compensate the biases raised from differential genomic coverage while conserving the differential genomic coverage, I improvised the normalization method.

(1) I calculated the read enrichment in CPM (counts per million reads) in 500 bp bins over the genome.

(2) I counted the number of bins containing more than one read.

(3) I divided the CPM values in each sample by the bin numbers in (2) - denoted as CPBM (counts per bins per million reads).

What do everybody think? Do you think this normalization is acceptable?

Any comments will be appreciated!

Thnak you!

ADD REPLYlink modified 10 months ago • written 10 months ago by mbk0asis570

This sort of scenario is really worst-case for MeDIP. The best you can do is find those regions with some notable amount of coverage in the KO samples and normalize only using them. This presumes, of course, that those regions weren't affected by the KO, which may not be true, but it's the best that you can do.

In general, I expect a strong global change in your data, since that's what we've seen in DNMT KO samples using WGBS.

ADD REPLYlink written 10 months ago by Devon Ryan97k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1053 users visited in the last hour