Question

how to randomly select intervals of different lengths (not records) from a BED file

0

Entering edit mode

6.9 years ago

jing.mengrabbit ▴ 60

Hi, I have a BED file like the following: chr1 6 12 chr1 13 28 chr1 32 74 chr1 101 151 chr2 2 17 chr3 23 90 chr8 97 123

I need to randomly generate several intervals of different lengths (say length 5, 4 and 3)and the intervals do not overlap with each other, and the output is like this: chr1 7 11 chr1 13 18 chr1 21 26 chr3 25 28 chr3 80 83

A method is to first create all possible intervals of length L by bedtools makewindow -w L -s 1 -b input.bed, and then pipe the output to "bedtools sample". The third step is to use bedtools subtract to remove the sampled intervals, and repeat step 1 and 2 for intervals of length L1.......... However, in the first step, bedtools output intervals whose lengths are not L as well, and the bedtools sample may output overlapping intervals. These unwanted lines can be removed by extra commands, but it takes more time especially for large BED files.

If you know quicker ways to finish what I need, could you please share with me? Thanks very much for your time!

sequence genome • 3.8k views

ADD COMMENT • link updated 6.9 years ago by bernatgel ★ 3.4k • written 6.9 years ago by jing.mengrabbit ▴ 60

score 3 · Answer 1 · 2017-06-12

If you can use R, you could use regioneR's createRandomRegions function to do that.

The trick here is to create a mask by subtracting your regions from the whole genome. With that mask the function will select random regions in the only unmasked parts of the genome it can use, the original regions in your bed file.

library(regioneR)

original <- read.table(file = "original_intervals.bed", sep="\t", stringsAsFactors = FALSE)

genome <- filterChromosomes(getGenome("hg19"))
mask <- subtractRegions(genome, original)

region.length <- 4
num.regions <- 5
new.regions <- createRandomRegions(nregions = num.regions, length.mean = region.length,
                                   length.sd = 0, genome = genome, mask = mask,
                                   non.overlapping = TRUE)

And you'll get a GRanges with your random regions.

new.regions

GRanges object with 5 ranges and 0 metadata columns:
      seqnames     ranges strand
         <Rle>  <IRanges>  <Rle>
  [1]     chr3 [ 48,  51]      *
  [2]     chr1 [ 14,  17]      *
  [3]     chr3 [ 64,  67]      *
  [4]     chr8 [109, 112]      *
  [5]     chr1 [ 67,  70]      *
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

If you need to save it to a bed file, you'll need to transform it into a data.frame first

write.table(toDataframe(new.regions), file = "random.bed", sep="\t", col.names = FALSE, row.names = FALSE, quote=FALSE)

Hope this helps :)

score 2 · Answer 2 · 2017-06-11

It looks like a pretty straightforward scripting problem. You could efficiently solve overlaps at the end with bedops --partition, which is not available in other toolkits, as far as I know.

Here's a script you can use to subsample from a shuffled list of intervals:

#!/usr/bin/env python

import sys
import random
import argparse

def main():
    parser = argparse.ArgumentParser(description='Subsample within intervals in BED file')
    parser.add_argument("-n", "--n", type=int, help='sample count')
    args = parser.parse_args()

    c = 0

    for line in sys.stdin:
        elems = line.strip().split('\t')
        start = int(elems[1])
        stop = int(elems[2])
        length = stop - start
        sample_start_offset = random.randint(0, length)
        sample_length = random.randint(1, length - sample_start_offset - 1)
        elems[1] = str(start + sample_start_offset)
        elems[2] = str(start + sample_start_offset + sample_length)
        sys.stdout.write("%s\n" % ('\t'.join(elems)))
        c += 1
        if c == args.n:
            break

if __name__ == "__main__":
    main()

Here's how it could be used to sample 123 random subintervals:

$ subsample.py --n=123 < <(shuf intervals.bed) | sort-bed - | bedops --partition - | shuf -n 123 - | sort-bed - > answer.bed

How this works:

The command <(shuf intervals.bed) is a bash process substitution. You can use this to feed a shuffled list of intervals into subsample.py (the above-listed Python script). You can use shuf or any tool to shuffle lines of a file.

If your original intervals file is whole-genome scale and won't fit into the memory allotment for shuf (i.e. you get out-of-memory errors using shuf), I have a tool called sample that almost always gets around this limitation.

The Python script subsample.py parses each randomly-selected interval for its start and stop position, using this to generate a subinterval with a randomly selected start position and length, chosen such that the subinterval will fall entirely within the parent interval.

The output of this script is BED-formatted, but it is unsorted. We pipe this to sort-bed to sort it, so that downstream tools like bedops can work with it.

It is possible that subintervals overlap. So the sorted BED is piped to bedops --partition to make all elements disjoint.

If we split any subintervals, we'll end up with more subintervals than we originally asked for. So we pipe disjoint elements to shuf -n to grab N random subintervals of interest, and pipe that again to sort-bed to get sorted BED output. (We could use head -n, but on reflection, I think that could introduce bias in the sample.)

It looks a bit involved, but I suspect this will run a lot faster, especially on large inputs and sample sizes.

score 1 · Answer 3 · 2017-06-11

1

Entering edit mode

6.9 years ago

Aaronquinlan 12k

The problem is that when your input BED record is not evenly divisible by L, you will have one final interval whose length is L. Just use awk to filter those out. Let's pretend we want 5 randomly chose intervals of length 10:

> bedtools makewindows -b set.bed -w 10 \
     | awk '($3-$2) == 10' \
     | bedtools sample -i - -n 5

ADD COMMENT • link 6.9 years ago by Aaronquinlan 12k

0

Entering edit mode

Thanks for your reply. You are right, and there will be unwanted lines unless the record is evenly divisible. But still, there may be overlapping random samples from 'bedtools sample' if the slide window is set to 1 in 'makewindows'.

ADD REPLY • link 6.9 years ago by jing.mengrabbit ▴ 60