Question

Find "mean" cluster positions after random sampling of genomic events

0

Entering edit mode

6.7 years ago

peer_wuensche ▴ 10

I have the following problem:

I have 1bp coordinates in the genome which represent vector integration sites. These integration sites (IS) form clusters, meaning they occur in some relatively small areas (1-10kb) in the genome while the vast majority remains untouched.

I have a function which predicts these clusters based on only two arguments:

maximum distance between two IS to be counted into the same cluster
minimum number of IS to be counted as cluster

I need to compare two different data sets, which differ in their number of IS (approx. 100,000 vs. 1,000,000). So far I tried to sample the bigger data set to match the number of IS of the smaller one, however, I need to do this multiple times to get a decent representation of the original data. This is where the problem is: How can I predict the clusters for each sampling and then find a "mean" cluster position or let's say get likelihood for a cluster to occur at that position.

The cluster position is a range (i.e. chr1:8492643-8498245) and i get roughly 5000 of such ranges across the genome for each iteration.

I looked into intersect from bedtools but it seems it needs some sort of reference file, which it compares the other genomic ranges to. But I would rather have to compare all samplings between all samplings, like a pairwise comparison

I also tried a R solution, however, when I aggregate using the exact cluster positions I get a size bias towards bigger clusters, as these probably occur more often when sampling a small proportion of the data.

Any hints on packages or tools I could use or did anyone ever have a similar problem and a solution?

Any help is highly appreciated!!

R genome bedtools • 1.6k views

ADD COMMENT • link updated 6.7 years ago by Biostar 20 • written 6.7 years ago by peer_wuensche ▴ 10

1

Entering edit mode

It's not immediately clear what operations you are doing, but the following might help in that it describes a scalable approach to pairwise operations on an arbitrary number of inputs: http://bedops.readthedocs.io/en/latest/content/usage-examples/multiple-inputs.html

ADD REPLY • link 6.7 years ago by Alex Reynolds 35k

0

Entering edit mode

Thanks for pointing out this handy little tool! I checked it out and indeed it does help finding intersecting clusters in a pairwise comparison. It even lets you specify a minimum overlap. However, the output only tells you which cluster or genomic range in general interacts with what. It doesn't seem to have to option to generate new coordinated e.g. a new bed file with "common" or "mean" genomic ranges.

ADD REPLY • link 6.7 years ago by peer_wuensche ▴ 10

0

Entering edit mode

You could pipe the output of a bedmap --echo-map to a bedmap --merge operation to get a common genomic range for overlapping intervals within a pairwise combination.

I don't know what a "mean" genomic range is, however. Maybe you could investigate piping bedmap --echo-map-size to calc or similar, generate a mean mapped region size, and pipe that to awk or similar, along with a start position offset from the original reference BED element, which prints out a computed BED element.

I'd definitely read up on the --echo-map-* options as they might help you generate new elements with the features you want.

ADD REPLY • link 6.7 years ago by Alex Reynolds 35k