Question: Choosing Window Size (Sliding Window Approach) During Cnv Analysis By Readdepth Approach
4
gravatar for Vikas Bansal
8.7 years ago by
Vikas Bansal2.4k
Berlin, Germany
Vikas Bansal2.4k wrote:

Dear all,

There are some very good tools for CNV analysis (ReadDepth, mrCaNaVar etc) using read depth approach. They use sliding window approach and default parameters (for window size and window sliding) are set according to large genomes like Human (where chromosomes have millions of base pairs). I just want to know that, if we have different reference genome (not human), which have trillions of base pairs or very small genome (which have 100 or 1000 bp in each chromosome), then what should be the window size and length of sliding window? What is the correct method to choose, the correct window size and length of sliding window?

Thanks and Best regards,

Vikas

analysis reference cnv • 8.4k views
ADD COMMENTlink written 8.7 years ago by Vikas Bansal2.4k

Good question but off the top of my head, I can't think of a trillion base genome or a 100/1000 bp chromosome for any organism :)

ADD REPLYlink written 8.7 years ago by Neilfws49k

Yes I know that I assume the size of genome too large and too small, but we never know, may be we will find this kind of genome some day.

ADD REPLYlink written 8.7 years ago by Vikas Bansal2.4k
7
gravatar for Stefano Berri
8.7 years ago by
Stefano Berri4.1k
Cambridge, UK
Stefano Berri4.1k wrote:

I think it depends on overall coverage. If you have many many reads, you can set windows quite small, if you have few reads, you'll have to allow large windows. In the case of chromosomes (or contigs) of only 100-1000 bp, then you need many reads. Yoon et al (2009) say the distribution is like a Poisson with overdispersion. I find that the overdispersion is quite strong and so you can't say it is a Poisson distribution. Furthermore, mappability highly influence number of reads per window. In our paper we say that "From our experience in several different samples, selecting window size in which there are 30–180 read counts per window on average strikes a reasonable balance between error variability and bias of CNA"

Basically we have observed that with less than 30 reads per window it gets quite common that you have no reads and you can't tell if it is by chance (and low mappability) or because of actual copy loss. You "hit the bottom" and lose information about that window. On the other side, going above 180 reads per window doesn't do much, but reducing your resolution. Still if you have very high coverage, you can go beyoind that.

In fact, the CNAnorm script that converts bam file to window (bam2window.pl) let you set the size of the window OR the average number of reads in the sample with least reads. It calculates the right window size according to the sum of the chromosomes/contigs length as reported in the header of the sam/bam files.

Also, consider that with very short chromosomes/contigs, you might have some edge effect, as a considerable number of windows will be smaller than the others.

ADD COMMENTlink written 8.7 years ago by Stefano Berri4.1k

Thanks a lot for your reply. This line really helps -> selecting window size in which there are 30–180 read counts per window on average. But I did not get about, for small chromosomes, I might have edge effect? Can you please explain this little bit.

ADD REPLYlink written 8.7 years ago by Vikas Bansal2.4k

If you have a chromosome so tiny, let's say 210 bp, and your windows are 100bp, then 1 out of 3 windows are shorter with possible side effects. This is not a problem if you have 1 out of thousand. Who cares the very last window... but if the last window happens over and over, it might be a problem.

ADD REPLYlink written 8.7 years ago by Stefano Berri4.1k
4
gravatar for Chris Miller
8.7 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

In our readDepth package, we use a different approach for calculating window sizes that seeks to explicitly control the false discovery rate. The bullet point form is this:

  1. Though we expect a Poisson distribution, there is significant overdispersion in real-world data, so we model it using a negative binomial distribution

  2. Once you have the parameters of this distribution, you can model regions of 1x, 2x, and 3x copy number and determine optimally separating thresholds.

  3. At the tails of your distributions, there will be some some bins from the 2x distribution that are called as 3x (and vice-versa). Those are your false-positives.

  4. By iteratively adjusting the size of your windows, you can determine an window size that has the highest resolution while still remaining under a specified FDR. This size will be dependent on the depth of sequencing. (Higher coverage = smaller windows)

Our tests show that 100bp windows may be too small, and that something more like 1-5000bp may be better for a 30x genome. If you have even less coverage, from say, a draft genome, that size will go up considerably.

If you're interested in the gritty details, you can read the paper (open access - PLoS).

ADD COMMENTlink written 8.7 years ago by Chris Miller21k
3
gravatar for ALchEmiXt
8.7 years ago by
ALchEmiXt1.9k
The Netherlands
ALchEmiXt1.9k wrote:

A general rule of thumb in sliding-window approaches is that the choosen window size approaches the minimal size of the features you want to be able to detect.

So smaller window allows smaller segments to be earlier significant different than wehn choosing larger window sizes... But overall coverage indeed matters as well. For bacterial genomes 2-4Mb and plasmids 50-100kbs we cope with it like this and that works fine to detect our major features of interest (for instance transposons/repeats).

ADD COMMENTlink written 8.7 years ago by ALchEmiXt1.9k
2

There's a trade off, though. The smaller your windows, the more noisy your data is.

ADD REPLYlink written 8.7 years ago by Chris Miller21k
2
gravatar for Pascal
8.7 years ago by
Pascal1.5k
Barcelona
Pascal1.5k wrote:

Very good question indeed. One usually use 100bp-window (non-overlapping windows). Please have a look to "Sensitive and accurate detection of copy number variants using read depth of coverage." (Yoon et al.), there is a good explanation of why they did use this length.

ADD COMMENTlink written 8.7 years ago by Pascal1.5k
1

It really depends on coverage! 100bp probably assumes "typical" 30X coverage, but the beauty of NextGen is that you can tune the coverage. CNA works well with 0.05X coverage

ADD REPLYlink written 8.7 years ago by Stefano Berri4.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1033 users visited in the last hour