Question

Overlapping Windows

2

Entering edit mode

12.4 years ago

Random ▴ 160

If I were to create a plot representing a genomic region where I have two axes representing say, depth of coverage in x, and GC content on y for example, and since the region is big I have to do it by windows, where each window would correspond to the mean depth of coverage of 10000 base pairs.

In the case I decide to create another set of 10kbp windows, but this time starting at position 5k so that effectively each new window overlaps two old neighbouring windows by 5kbp, what kind of transformations should I make to my data, since each region is effectively being represented twice? Just normalize it?

Can you refer me to papers which make use of this kind of exploration region-wise, and show different ways on how to best capture the true variation within a genomic region, and minimize the possible errors that overlapping windows may introduce (such as signal-to-noise approaches)?

Thanks

coverage genomics coordinates • 5.1k views

ADD COMMENT • link updated 12.4 years ago by 2184687-1231-83- ★ 5.1k • written 12.4 years ago by Random ▴ 160

0

Entering edit mode

So are you making a 3d graph? Where x and y are gc content, coverage and z is every 10kb? I don't quite get how the graph is set up.

ADD REPLY • link 12.4 years ago by Damian Kao 16k

0

Entering edit mode

Maybe you mean to plot the coverage/gc% ratio per base? Or indeed a 3D graph of coverage, GC, and position?

ADD REPLY • link 12.4 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

I meant the coverage/gc% ratio for base, but since plotting the 10 million points would result in a hard and slow to render plot, full of indistinguishable peaks, I instead wanted to smooth it and understand where actually there may be the regions with excess variation by reducing chunks of 10k points into a single one.

ADD REPLY • link 12.4 years ago by Random ▴ 160

score 2 · Answer 1 · 2011-12-08

2

Entering edit mode

12.4 years ago

2184687-1231-83- ★ 5.1k

What you suggest sounds like the sliding window approach, which has been used for inspecting variation before. Here is a reference:

Rozas J, Rozas R: DnaSP, DNA sequence polymorphism: an interactive program for estimating Population Genetics parameters from DNA sequence data.
Comput Appl Biosci 1995, 11:621-625.

We wrote a tool for genome-wide analysis of polymorphisms a while ago, VariScan, which implements two kinds of sliding windows: one that is fixed for genomic stretches, and one that fixes the number of polymorphisms per window:

http://www.ub.edu/softevol/variscan/

There may be other newer implementations of the same concept out there these days.

ADD COMMENT • link 12.4 years ago by 2184687-1231-83- ★ 5.1k

0

Entering edit mode

Incidentally what made me ask this question was exactly when I saw Rozas questioning a PhD candidate, in his thesis defense, about how he exactly constructed the sliding window approach since the candidate himself wasn't totally aware of the normalization issue.

ADD REPLY • link 12.4 years ago by Random ▴ 160

score 1 · Answer 2 · 2011-12-08

I am not entirely sure if this answers your question but I think you could at best use a sliding window approach for that. So for every base you calculate the average of coverage of the region 1/2w - x - 1/2w. In this way we usually assess overrepresented coverage sections in our genome sequencing projects. it filters the noise and depending on the window size you will be able to detect significant differences (i.e. <>2sd) of a feature size roughly comparable to the choosen window-size; our formula is basically like this:

[?][?]u(i) = 1/(Nwindow+1) * SUM(Xi+m)

where u(i) is average of window at position i
Nwindow is the window size choosen (since the window is actuall 1/2 before and 1/2 after we need to add 1 later
SUM(Xim) is the sum of coverage (=X) from positions i minus 1/2 windows size till i plus 1/2 windows size (half window size we defined here as m starting at -1/2N till +1/2N

So this deals only with the coverage issue (y-axis). You pobably can do the same for GC content and plot either against each other in for instance a 3D graph over position i.

PS: sorry for complex explanation. Just simply cannot fit a proper Sigma function in here...

PPS: please note that in case of circular genomes you need to make sure the window goes over to the other end when approaching the start or end within 1/2 windows size!