Question: how to randomly select intervals of different lengths (not records) from a BED file
0
gravatar for jing.mengrabbit
2.0 years ago by
jing.mengrabbit30 wrote:

Hi, I have a BED file like the following: chr1 6 12 chr1 13 28 chr1 32 74 chr1 101 151 chr2 2 17 chr3 23 90 chr8 97 123

I need to randomly generate several intervals of different lengths (say length 5, 4 and 3)and the intervals do not overlap with each other, and the output is like this: chr1 7 11 chr1 13 18 chr1 21 26 chr3 25 28 chr3 80 83

A method is to first create all possible intervals of length L by bedtools makewindow -w L -s 1 -b input.bed, and then pipe the output to "bedtools sample". The third step is to use bedtools subtract to remove the sampled intervals, and repeat step 1 and 2 for intervals of length L1.......... However, in the first step, bedtools output intervals whose lengths are not L as well, and the bedtools sample may output overlapping intervals. These unwanted lines can be removed by extra commands, but it takes more time especially for large BED files.

If you know quicker ways to finish what I need, could you please share with me? Thanks very much for your time!

sequence genome • 1.3k views
ADD COMMENTlink modified 2.0 years ago by bernatgel1.9k • written 2.0 years ago by jing.mengrabbit30
3
gravatar for bernatgel
2.0 years ago by
bernatgel1.9k
Barcelona, Spain
bernatgel1.9k wrote:

If you can use R, you could use regioneR's createRandomRegions function to do that.

The trick here is to create a mask by subtracting your regions from the whole genome. With that mask the function will select random regions in the only unmasked parts of the genome it can use, the original regions in your bed file.

library(regioneR)

original <- read.table(file = "original_intervals.bed", sep="\t", stringsAsFactors = FALSE)

genome <- filterChromosomes(getGenome("hg19"))
mask <- subtractRegions(genome, original)

region.length <- 4
num.regions <- 5
new.regions <- createRandomRegions(nregions = num.regions, length.mean = region.length,
                                   length.sd = 0, genome = genome, mask = mask,
                                   non.overlapping = TRUE)

And you'll get a GRanges with your random regions.

new.regions

GRanges object with 5 ranges and 0 metadata columns:
      seqnames     ranges strand
         <Rle>  <IRanges>  <Rle>
  [1]     chr3 [ 48,  51]      *
  [2]     chr1 [ 14,  17]      *
  [3]     chr3 [ 64,  67]      *
  [4]     chr8 [109, 112]      *
  [5]     chr1 [ 67,  70]      *
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

If you need to save it to a bed file, you'll need to transform it into a data.frame first

write.table(toDataframe(new.regions), file = "random.bed", sep="\t", col.names = FALSE, row.names = FALSE, quote=FALSE)

Hope this helps :)

ADD COMMENTlink written 2.0 years ago by bernatgel1.9k

Thanks for providing another solution. I am going to look into it and see if there are other problems coming up.

ADD REPLYlink written 2.0 years ago by jing.mengrabbit30
2
gravatar for Alex Reynolds
2.0 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

It looks like a pretty straightforward scripting problem. You could efficiently solve overlaps at the end with bedops --partition, which is not available in other toolkits, as far as I know.

Here's a script you can use to subsample from a shuffled list of intervals:

#!/usr/bin/env python

import sys
import random
import argparse

def main():
    parser = argparse.ArgumentParser(description='Subsample within intervals in BED file')
    parser.add_argument("-n", "--n", type=int, help='sample count')
    args = parser.parse_args()

    c = 0

    for line in sys.stdin:
        elems = line.strip().split('\t')
        start = int(elems[1])
        stop = int(elems[2])
        length = stop - start
        sample_start_offset = random.randint(0, length)
        sample_length = random.randint(1, length - sample_start_offset - 1)
        elems[1] = str(start + sample_start_offset)
        elems[2] = str(start + sample_start_offset + sample_length)
        sys.stdout.write("%s\n" % ('\t'.join(elems)))
        c += 1
        if c == args.n:
            break

if __name__ == "__main__":
    main()

Here's how it could be used to sample 123 random subintervals:

$ subsample.py --n=123 < <(shuf intervals.bed) | sort-bed - | bedops --partition - | shuf -n 123 - | sort-bed - > answer.bed

How this works:

The command <(shuf intervals.bed) is a bash process substitution. You can use this to feed a shuffled list of intervals into subsample.py (the above-listed Python script). You can use shuf or any tool to shuffle lines of a file.

If your original intervals file is whole-genome scale and won't fit into the memory allotment for shuf (i.e. you get out-of-memory errors using shuf), I have a tool called sample that almost always gets around this limitation.

The Python script subsample.py parses each randomly-selected interval for its start and stop position, using this to generate a subinterval with a randomly selected start position and length, chosen such that the subinterval will fall entirely within the parent interval.

The output of this script is BED-formatted, but it is unsorted. We pipe this to sort-bed to sort it, so that downstream tools like bedops can work with it.

It is possible that subintervals overlap. So the sorted BED is piped to bedops --partition to make all elements disjoint.

If we split any subintervals, we'll end up with more subintervals than we originally asked for. So we pipe disjoint elements to shuf -n to grab N random subintervals of interest, and pipe that again to sort-bed to get sorted BED output. (We could use head -n, but on reflection, I think that could introduce bias in the sample.)

It looks a bit involved, but I suspect this will run a lot faster, especially on large inputs and sample sizes.

ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by Alex Reynolds28k

Thanks for your help. I will look into this.

ADD REPLYlink written 2.0 years ago by jing.mengrabbit30
0
gravatar for Aaronquinlan
2.0 years ago by
Aaronquinlan11k
United States
Aaronquinlan11k wrote:

The problem is that when your input BED record is not evenly divisible by L, you will have one final interval whose length is L. Just use awk to filter those out. Let's pretend we want 5 randomly chose intervals of length 10:

> bedtools makewindows -b set.bed -w 10 \
     | awk '($3-$2) == 10' \
     | bedtools sample -i - -n 5
ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by Aaronquinlan11k

Thanks for your reply. You are right, and there will be unwanted lines unless the record is evenly divisible. But still, there may be overlapping random samples from 'bedtools sample' if the slide window is set to 1 in 'makewindows'.

ADD REPLYlink written 2.0 years ago by jing.mengrabbit30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 656 users visited in the last hour