Question: Find "mean" cluster positions after random sampling of genomic events
gravatar for peer_wuensche
2.9 years ago by
peer_wuensche10 wrote:

I have the following problem:

I have 1bp coordinates in the genome which represent vector integration sites. These integration sites (IS) form clusters, meaning they occur in some relatively small areas (1-10kb) in the genome while the vast majority remains untouched.

I have a function which predicts these clusters based on only two arguments:

  1. maximum distance between two IS to be counted into the same cluster
  2. minimum number of IS to be counted as cluster

I need to compare two different data sets, which differ in their number of IS (approx. 100,000 vs. 1,000,000). So far I tried to sample the bigger data set to match the number of IS of the smaller one, however, I need to do this multiple times to get a decent representation of the original data. This is where the problem is: How can I predict the clusters for each sampling and then find a "mean" cluster position or let's say get likelihood for a cluster to occur at that position.

The cluster position is a range (i.e. chr1:8492643-8498245) and i get roughly 5000 of such ranges across the genome for each iteration.

I looked into intersect from bedtools but it seems it needs some sort of reference file, which it compares the other genomic ranges to. But I would rather have to compare all samplings between all samplings, like a pairwise comparison

I also tried a R solution, however, when I aggregate using the exact cluster positions I get a size bias towards bigger clusters, as these probably occur more often when sampling a small proportion of the data.

Any hints on packages or tools I could use or did anyone ever have a similar problem and a solution?

Any help is highly appreciated!!

bedtools R genome • 905 views
ADD COMMENTlink modified 2.9 years ago by Biostar ♦♦ 20 • written 2.9 years ago by peer_wuensche10

It's not immediately clear what operations you are doing, but the following might help in that it describes a scalable approach to pairwise operations on an arbitrary number of inputs:

ADD REPLYlink written 2.9 years ago by Alex Reynolds30k

Thanks for pointing out this handy little tool! I checked it out and indeed it does help finding intersecting clusters in a pairwise comparison. It even lets you specify a minimum overlap. However, the output only tells you which cluster or genomic range in general interacts with what. It doesn't seem to have to option to generate new coordinated e.g. a new bed file with "common" or "mean" genomic ranges.

ADD REPLYlink written 2.9 years ago by peer_wuensche10

You could pipe the output of a bedmap --echo-map to a bedmap --merge operation to get a common genomic range for overlapping intervals within a pairwise combination.

I don't know what a "mean" genomic range is, however. Maybe you could investigate piping bedmap --echo-map-size to calc or similar, generate a mean mapped region size, and pipe that to awk or similar, along with a start position offset from the original reference BED element, which prints out a computed BED element.

I'd definitely read up on the --echo-map-* options as they might help you generate new elements with the features you want.

ADD REPLYlink written 2.9 years ago by Alex Reynolds30k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1477 users visited in the last hour