Question

using GRanges metadata to constrain overlap searches between objects

0

Entering edit mode

12 months ago

Charles Murtaugh ▴ 50

I have two sets of genomic data, one a list of SNPs and another a list of ChIP-seq-type peaks, each drawn from the same set of multiple samples. I'd like to be able to ask, for each sample, which SNPs overlap with the peaks in the same sample - and to do this fast enough that I can run a permutation test, randomizing the sample assignment of each SNP 100 or 1,000 times and counting the resulting overlaps, to get a statistical idea of which samples are enriched for SNP-peak overlaps. The data is organized as GRanges objects, and I'm using plyranges functions for analysis - but any other R package that could do it would be fine.

The problem is that none of the find_overlaps-style functions can use metadata (e.g. sample ID) as part of their search process, so if I have 100 samples, then each SNP in each sample is checked against the peaks from all 100 samples, which makes produces a huge output table that is clogged with irrelevant comparisons, plus takes a lot of time. My inelegant workaround, as shown in the dummy code below, is to filter the overlap table to keep only hits where the SNV and peak came from the same sample (sample_id.x==sample_id.y), and use that for counting.

The dummy code reproduces the issue: three samples, 100 SNPs each, randomly assigned from nt 1-2500 except for 25 SNPs of one sample (s03) that are scattered in a window around position 1000; and four 20 bp peaks, of which one, detected in s03, overlaps position 1000. On my laptop, it took 46 seconds to complete with only 100 permutations (note that the progress counter for the permutation loop adds essentially no time). It gets much, much longer if you scale up to 100+ samples with realistic numbers of SNPs and peaks! Any advice for a overlap-searching approach that can be constrained based on sample ID-type metadata would be appreciated.

Code:

library(tidyverse)
library(plyranges)

length <- 2500
ids <- c(rep('s01', 100), rep('s02', 100), rep('s03', 100))
snps <- GRanges(Rle('chr1'), IRanges(c(sample((1:length), 100, replace=T), 
                                       sample((1:length), 100, replace=T), 
                                       sample((980:1040), 25, replace=T), 
                                       sample((1:length), 75, replace=T)), width=1),
                sample_id=ids)
peaks <- GRanges(Rle('chr1'), IRanges(c(500, 1000, 1000, 2000), width=20),
                 sample_id=c('s01', 's01', 's03', 's02'))
ov <- find_overlaps(snps, peaks) %>% print()  # finds all overlaps, including between mismatched samples
ov <- filter(ov, sample_id.x==sample_id.y) %>% print()  # only matched samples

n_permut <- 100  # number of times to permute sample IDs

startTime <- Sys.time()
for(s in unique(snps$sample_id)) {
  print(s)
  sample_overlaps <- length(filter(ov, sample_id.x==s))
  permut_overlaps <- rep(0, n_permut)  # vector for counting overlaps of permuted SNPs
  for(i in 1:n_permut) {
    # progress counter
    if(i %% (n_permut/10) == 0) print(paste0((i/n_permut)*100, '% complete'))
    snps_permut <- mutate(snps, sample_id=sample(ids, 300, replace=F))
    ov_permut <- find_overlaps(snps_permut, peaks) %>% filter(sample_id.x==sample_id.y)
    permut_overlaps[i] <- length(filter(ov_permut, sample_id.x==s))
  }
  p_permut <- length(which(permut_overlaps >= sample_overlaps))/length(permut_overlaps)
  print(paste0('sample ', s, ': fraction of permuted SNP distributions with higher overlap = ', p_permut))
}
endTime <- Sys.time()
print(paste0('Time elapsed: ', endTime-startTime))

plyranges Bioconductor GRanges R GenomicRanges • 508 views

ADD COMMENT • link updated 12 months ago by seidel 11k • written 12 months ago by Charles Murtaugh ▴ 50

score 0 · Answer 1 · 2023-03-27

Rather than do all overlaps all the time for all samples, why not restrict the data by sample when you can. The code below works about 4 times faster (24 seconds vs. 2 minutes) than your version (hopefully it's doing the same thing).

Modifications: 1.) permute snp list, but keep only those with the current sample being tested 2.) count overlaps between permuted snps and peaks from the matching sample

One could also consider a method where you split the GRanges objects by sample, and loop through using the universe of available snps.

startTime <- Sys.time()
for(s in unique(snps$sample_id)) {
  print(s)
  permut_overlaps <- rep(0, n_permut)  # vector for counting overlaps of permuted SNPs  
  sample_overlaps <- length(filter(ov, sample_id.x==s))
  for(i in 1:n_permut) {
    # progress counter
    if(i %% (n_permut/10) == 0) print(paste0((i/n_permut)*100, '% complete'))
    # make a copy of snps so we can permute and filter
    snps_permut <- snps
    # permute snp ids
    snps_permut$sample_id <- sample(snps_permut$sample_id, 300, replace=F)
    # keep only those we're testing
    snps_permut <- snps_permut[snps_permut$sample_id == s]
    # sum overlaps with just our peaks of interest
    permut_overlaps[i] <- sum(count_overlaps(snps_permut, peaks[peaks$sample_id == s]))
  }
  p_permut <- length(which(permut_overlaps >= sample_overlaps))/length(permut_overlaps)
  print(paste0('sample ', s, ': fraction of permuted SNP distributions with higher overlap = ', p_permut))
}
endTime <- Sys.time()
print(paste0('Time elapsed: ', endTime-startTime))