Question

Choosing Window Size (Sliding Window Approach) During Cnv Analysis By Readdepth Approach

5

Entering edit mode

12.5 years ago

Vikas Bansal ★ 2.4k

Dear all,

There are some very good tools for CNV analysis (ReadDepth, mrCaNaVar etc) using read depth approach. They use sliding window approach and default parameters (for window size and window sliding) are set according to large genomes like Human (where chromosomes have millions of base pairs). I just want to know that, if we have different reference genome (not human), which have trillions of base pairs or very small genome (which have 100 or 1000 bp in each chromosome), then what should be the window size and length of sliding window? What is the correct method to choose, the correct window size and length of sliding window?

Thanks and Best regards,

Vikas

cnv analysis reference • 12k views

ADD COMMENT • link updated 12.5 years ago by Chris Miller 22k • written 12.5 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

Good question but off the top of my head, I can't think of a trillion base genome or a 100/1000 bp chromosome for any organism :)

ADD REPLY • link 12.5 years ago by Neilfws 49k

0

Entering edit mode

Yes I know that I assume the size of genome too large and too small, but we never know, may be we will find this kind of genome some day.

ADD REPLY • link 12.5 years ago by Vikas Bansal ★ 2.4k

score 9 · Answer 1 · 2012-02-28

I think it depends on overall coverage. If you have many many reads, you can set windows quite small, if you have few reads, you'll have to allow large windows. In the case of chromosomes (or contigs) of only 100-1000 bp, then you need many reads. Yoon et al (2009) say the distribution is like a Poisson with overdispersion. I find that the overdispersion is quite strong and so you can't say it is a Poisson distribution. Furthermore, mappability highly influence number of reads per window. In our paper we say that "From our experience in several different samples, selecting window size in which there are 30–180 read counts per window on average strikes a reasonable balance between error variability and bias of CNA"

Basically we have observed that with less than 30 reads per window it gets quite common that you have no reads and you can't tell if it is by chance (and low mappability) or because of actual copy loss. You "hit the bottom" and lose information about that window. On the other side, going above 180 reads per window doesn't do much, but reducing your resolution. Still if you have very high coverage, you can go beyoind that.

In fact, the CNAnorm script that converts bam file to window (bam2window.pl) let you set the size of the window OR the average number of reads in the sample with least reads. It calculates the right window size according to the sum of the chromosomes/contigs length as reported in the header of the sam/bam files.

Also, consider that with very short chromosomes/contigs, you might have some edge effect, as a considerable number of windows will be smaller than the others.

score 4 · Answer 2 · 2012-02-28

4

Entering edit mode

12.5 years ago

Pascal ★ 1.5k

Very good question indeed. One usually use 100bp-window (non-overlapping windows). Please have a look to "Sensitive and accurate detection of copy number variants using read depth of coverage." (Yoon et al.), there is a good explanation of why they did use this length.

ADD COMMENT • link 12.5 years ago by Pascal ★ 1.5k

1

Entering edit mode

It really depends on coverage! 100bp probably assumes "typical" 30X coverage, but the beauty of NextGen is that you can tune the coverage. CNA works well with 0.05X coverage

ADD REPLY • link 12.5 years ago by Stefano Berri 4.4k

score 4 · Answer 3 · 2012-02-28

In our readDepth package, we use a different approach for calculating window sizes that seeks to explicitly control the false discovery rate. The bullet point form is this:

Though we expect a Poisson distribution, there is significant overdispersion in real-world data, so we model it using a negative binomial distribution
Once you have the parameters of this distribution, you can model regions of 1x, 2x, and 3x copy number and determine optimally separating thresholds.
At the tails of your distributions, there will be some some bins from the 2x distribution that are called as 3x (and vice-versa). Those are your false-positives.
By iteratively adjusting the size of your windows, you can determine an window size that has the highest resolution while still remaining under a specified FDR. This size will be dependent on the depth of sequencing. (Higher coverage = smaller windows)

Our tests show that 100bp windows may be too small, and that something more like 1-5000bp may be better for a 30x genome. If you have even less coverage, from say, a draft genome, that size will go up considerably.

If you're interested in the gritty details, you can read the paper (open access - PLoS).

score 3 · Answer 4 · 2012-02-28

3

Entering edit mode

12.5 years ago

ALchEmiXt ★ 1.9k

A general rule of thumb in sliding-window approaches is that the choosen window size approaches the minimal size of the features you want to be able to detect.

So smaller window allows smaller segments to be earlier significant different than wehn choosing larger window sizes... But overall coverage indeed matters as well. For bacterial genomes 2-4Mb and plasmids 50-100kbs we cope with it like this and that works fine to detect our major features of interest (for instance transposons/repeats).

ADD COMMENT • link 12.5 years ago by ALchEmiXt ★ 1.9k

2

Entering edit mode

There's a trade off, though. The smaller your windows, the more noisy your data is.

ADD REPLY • link 12.5 years ago by Chris Miller 22k