I have the following problem:
I have 1bp coordinates in the genome which represent vector integration sites. These integration sites (IS) form clusters, meaning they occur in some relatively small areas (1-10kb) in the genome while the vast majority remains untouched.
I have a function which predicts these clusters based on only two arguments:
- maximum distance between two IS to be counted into the same cluster
- minimum number of IS to be counted as cluster
I need to compare two different data sets, which differ in their number of IS (approx. 100,000 vs. 1,000,000). So far I tried to sample the bigger data set to match the number of IS of the smaller one, however, I need to do this multiple times to get a decent representation of the original data. This is where the problem is: How can I predict the clusters for each sampling and then find a "mean" cluster position or let's say get likelihood for a cluster to occur at that position.
The cluster position is a range (i.e. chr1:8492643-8498245) and i get roughly 5000 of such ranges across the genome for each iteration.
I looked into intersect from bedtools but it seems it needs some sort of reference file, which it compares the other genomic ranges to. But I would rather have to compare all samplings between all samplings, like a pairwise comparison
I also tried a R solution, however, when I aggregate using the exact cluster positions I get a size bias towards bigger clusters, as these probably occur more often when sampling a small proportion of the data.
Any hints on packages or tools I could use or did anyone ever have a similar problem and a solution?
Any help is highly appreciated!!