What kind of systematic bias of sequencing data the following normalization procedure address?
1
0
Entering edit mode
7.3 years ago
Dataman ▴ 370

Let's assume that you have extracted read-depths from BAM files (containing aligned reads for a dna-seq experiment) for both samples of normal/tumor pair and you have calculated the read-depth log-ratios simply by:

log-ratio =  log_2( tumor_rd_i / normal_rd_i)


where tumor_rd_i and normal_rd_i are the tumor and normal read depth at i_th window.

The normalization procedure goes like this:

For each chromosome, you make a histogram of all the log ratios values and find the bin with highest frequency and then find the mid point of this bin (let call this value 'mode'). Then you find the median value of all the modes (coming from all the chromosomes). Finally, you subtract this value (median of the modes) from each read depth log ratio value.

I can imagine that this procedure shifts the individual values towards zero; something like performing zero-mean normalization. But what I cannot get is that why we need to do such normalization in the first place for the sequencing data before for instance doing segmentation. Any explanation is much appreciated.

next-gen normalization bias • 1.9k views
0
Entering edit mode

Could you please provide reference (you are saying "need to do such normalization", where are you finding this information)?

0
Entering edit mode

I am reading a Python code and what I have written here is a summary of I what I have understood from the code.

2
Entering edit mode
7.3 years ago
matted 7.7k

Imagine the total sequencing depths were different for tumor and normal. This would give you a multiplicative factor inside the log, which is additive once you take it outside the log. So this procedure (additive normalization of log values) is designed to correct this away.

As far as estimating it by considering the median of values coming from each chromosome, I imagine this could be to deal with aneuploidy from cancer data. Even if the total read depths are exactly matched between the two experiments, if the tumor has an extra copy of a particular chromosome, that will throw off all the log ratios (both if you don't correct and if you try to correct naively by total read count). Taking the median across all chromosomes makes the estimate robust, and should give you the right answer if the majority of chromosomes are in equal copy number.

0
Entering edit mode

Thank you! That makes total sense to me know. I tried to exemplify what you have explained and now everything makes sense. For instance, let's assume that tumor data is 60x and normal 30x. As a result, for a region with no aberration (i.e. copy number=2) we expect log_2 ratio to be 0. However, in this case, we get 1 (pointing to amplification where actually there is none) since log_2(60/30) = log_2(2) + log_2(1) = 1. This is unwanted and needs to be corrected.

For the use case of median, I imagined a case where both samples are 30x and all chromosomes except let's say chr9 are normal (i.e. having no alteration) and the entire chr9 has gained 2 copies resulting in log_2 ratio of 1 for chr9. this is a genuine signal and we do not want to get rid of it let's say by averaging over the number of all chromosomes; and that is why median comes to rescue. Is that right?

1
Entering edit mode

Yes, that all sounds right to me.