Can someone give me an explanation of how sliding windows are used for CNV analysis?
i.e., suppose I'm analyzing CNVs for chrI of S. cerevisiae. I create a pileup, take the read depth at every base position, and then divide it by the average read depth. What would I do with a sliding window (e.g., of window size 100bp) to make this more accurate?
When you are looking for copy number changes, you are looking for regions of the genome will have a different number of reads. If one of the chromosomes has a 100bp deletion, you expect that there would be half as many reads in that region compared to surrounding region (if the organism is diploid). If there is an amplification/repeat then there would be more reads in that region compared to a surrounding region.
A simple way to figure out if there is changes in coverage (number of reads overlapping region in the genome) is to split the genome into bins and count how many reads are in each bin. Changes in this number would be an indicator of having a copy number alteration in this bin. If you have a 100bp deletion but your bin size is 500,000 bp, then the reduced number of reads in that bin would be harder to detect than if your bin size was 500 bp. Thus, a smaller 'window' would be more sensitive when you are looking for copy number changes.
However, keep in mind that smaller windows will be more computationally intensive (more coverage calculations would have to be done and more numbers would have to be stored). Overlapping bins are sometimes also called sliding windows because it can be thought of as sliding a 'window' of the genome you are looking at and doing a calculation on each window.
A followup question: suppose I used a 100bp window to calculate CNV, and I want to create a plot of CNV in chrI. With a static window, I can do this very easily, i.e., divide up the chromosome into 100bp regions, perform the calculations, and use the output to create a plot. How does one go about doing this with sliding windows? I know many pieces of software, such as ReadDepth and cnvnator use a sliding window for just this purpose--but I'm not sure how they assimilate the information into a single histogram.
So how I would think of these things is by saying you are taking a point and creating a window +/- that point. Then creating another point a bit further away and creating a window +/- of that next point. There are two parameters here, the distance between points and the +/- amount.
When you split up the genome into 100bp non-overlapping windows, what you are doing is setting points 100bp apart with +/- of 50bp. If you were to instead have points 50bp apart, then the windows would 'overlap' since 50bp of each point would be shared with the point before and another 50bp would be shared with the point after.
For the plot, I assume you would be plotting coverage on the Y axis and location on the X axis. Just choose the X axis that corresponds to point that you created window around to plot the coverage. One thing to note here though is that this plot would not directly tell you what the window size is (the +/- distance) and how much overlap there was.
Thanks! This was really detailed and helpful.
A followup question: suppose I used a 100bp window to calculate CNV, and I want to create a plot of CNV in chrI. With a static window, I can do this very easily, i.e., divide up the chromosome into 100bp regions, perform the calculations, and use the output to create a plot. How does one go about doing this with sliding windows? I know many pieces of software, such as ReadDepth and cnvnator use a sliding window for just this purpose--but I'm not sure how they assimilate the information into a single histogram.
So how I would think of these things is by saying you are taking a point and creating a window +/- that point. Then creating another point a bit further away and creating a window +/- of that next point. There are two parameters here, the distance between points and the +/- amount.
When you split up the genome into 100bp non-overlapping windows, what you are doing is setting points 100bp apart with +/- of 50bp. If you were to instead have points 50bp apart, then the windows would 'overlap' since 50bp of each point would be shared with the point before and another 50bp would be shared with the point after.
For the plot, I assume you would be plotting coverage on the Y axis and location on the X axis. Just choose the X axis that corresponds to point that you created window around to plot the coverage. One thing to note here though is that this plot would not directly tell you what the window size is (the +/- distance) and how much overlap there was.