Question: Overlapping Windows
2
gravatar for Random
8.0 years ago by
Random160
Random160 wrote:

If I were to create a plot representing a genomic region where I have two axes representing say, depth of coverage in x, and GC content on y for example, and since the region is big I have to do it by windows, where each window would correspond to the mean depth of coverage of 10000 base pairs.

In the case I decide to create another set of 10kbp windows, but this time starting at position 5k so that effectively each new window overlaps two old neighbouring windows by 5kbp, what kind of transformations should I make to my data, since each region is effectively being represented twice? Just normalize it?

Can you refer me to papers which make use of this kind of exploration region-wise, and show different ways on how to best capture the true variation within a genomic region, and minimize the possible errors that overlapping windows may introduce (such as signal-to-noise approaches)?

Thanks

coordinates coverage genomics • 3.4k views
ADD COMMENTlink modified 8.0 years ago by 2184687-1231-83-5.0k • written 8.0 years ago by Random160

So are you making a 3d graph? Where x and y are gc content, coverage and z is every 10kb? I don't quite get how the graph is set up.

ADD REPLYlink written 8.0 years ago by Damian Kao15k

Maybe you mean to plot the coverage/gc% ratio per base? Or indeed a 3D graph of coverage, GC, and position?

ADD REPLYlink written 8.0 years ago by ALchEmiXt1.9k

I meant the coverage/gc% ratio for base, but since plotting the 10 million points would result in a hard and slow to render plot, full of indistinguishable peaks, I instead wanted to smooth it and understand where actually there may be the regions with excess variation by reducing chunks of 10k points into a single one.

ADD REPLYlink written 8.0 years ago by Random160
2
gravatar for 2184687-1231-83-
8.0 years ago by
2184687-1231-83-5.0k wrote:

What you suggest sounds like the sliding window approach, which has been used for inspecting variation before. Here is a reference:

Rozas J, Rozas R: DnaSP, DNA sequence polymorphism: an interactive program for estimating Population Genetics parameters from DNA sequence data.
Comput Appl Biosci 1995, 11:621-625.

We wrote a tool for genome-wide analysis of polymorphisms a while ago, VariScan, which implements two kinds of sliding windows: one that is fixed for genomic stretches, and one that fixes the number of polymorphisms per window:

http://www.ub.edu/softevol/variscan/

There may be other newer implementations of the same concept out there these days.

ADD COMMENTlink modified 8.0 years ago • written 8.0 years ago by 2184687-1231-83-5.0k

Incidentally what made me ask this question was exactly when I saw Rozas questioning a PhD candidate, in his thesis defense, about how he exactly constructed the sliding window approach since the candidate himself wasn't totally aware of the normalization issue.

ADD REPLYlink written 8.0 years ago by Random160
1
gravatar for ALchEmiXt
8.0 years ago by
ALchEmiXt1.9k
The Netherlands
ALchEmiXt1.9k wrote:

I am not entirely sure if this answers your question but I think you could at best use a sliding window approach for that. So for every base you calculate the average of coverage of the region 1/2w - x - 1/2w. In this way we usually assess overrepresented coverage sections in our genome sequencing projects. it filters the noise and depending on the window size you will be able to detect significant differences (i.e. <>2sd) of a feature size roughly comparable to the choosen window-size; our formula is basically like this:

[?][?]u(i) = 1/(Nwindow+1) * SUM(Xi+m)

  • where u(i) is average of window at position i
  • Nwindow is the window size choosen (since the window is actuall 1/2 before and 1/2 after we need to add 1 later
  • SUM(Xim) is the sum of coverage (=X) from positions i minus 1/2 windows size till i plus 1/2 windows size (half window size we defined here as m starting at -1/2N till +1/2N

So this deals only with the coverage issue (y-axis). You pobably can do the same for GC content and plot either against each other in for instance a 3D graph over position i.

PS: sorry for complex explanation. Just simply cannot fit a proper Sigma function in here...

PPS: please note that in case of circular genomes you need to make sure the window goes over to the other end when approaching the start or end within 1/2 windows size!

ADD COMMENTlink modified 8.0 years ago • written 8.0 years ago by ALchEmiXt1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2395 users visited in the last hour