Question: Pick random position from specific genomic regions
2
gravatar for Nicolas Rosewick
4.8 years ago by
Belgium, Brussels
Nicolas Rosewick7.3k wrote:

Hi,

I've a bed file containing several region of interest and I want to pick randomly N positions in these regions. I know that "bedtools random" can pick random positions into the genome, but I don't find any tool to perform that on specifi region described in a bed file. 

 

Thanks

random position bed • 3.9k views
ADD COMMENTlink modified 4.8 years ago by Ryan Dale4.8k • written 4.8 years ago by Nicolas Rosewick7.3k
1

How does your bed file differ from the genome file that is required for bedtools random ?

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by PoGibas4.7k

bedtools random takes chromosizes in input. <chromName><TAB><chromSize>

ADD REPLYlink written 4.8 years ago by Nicolas Rosewick7.3k

Is it required that the tool be pre-made or are you happy with a small R or python script? Also, when you say "pick N positions", do you mean from each of the regions or from all of the regions combined (I assume the latter, but this is ambiguous).

ADD REPLYlink written 4.8 years ago by Devon Ryan88k

From all the regions combined.

I just wonder if such a tool exists. I can indeed wrote a small script using bedtools random and check if the positions are in the regions in the bed file.

ADD REPLYlink written 4.8 years ago by Nicolas Rosewick7.3k
4
gravatar for Alex Reynolds
4.8 years ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

If you want to sample N elements without replacement from your set of input BED elements, replace N with your value:

$ shuf input.bed | head -N | sort-bed - > answer.bed

Or you can use my sample program on Github: https://github.com/alexpreynolds/sample

$ sample -k N -p input.bed > answer.bed

This also offers sampling with replacement and some other features that might be useful, depending on the experiment.

If you want to sample N elements without replacement from a population of individual bases made up from your set of input BED elements, use the bedops --chop 1 command to split your input BED file into individual bases, then do the usual shuf or sample operation:

$ bedops --chop 1 input.bed > perBaseInput.bed
$ sample -k N -p perBaseInput.bed > perBaseAnswer.bed

You could pipe the per-base output from bedops into shuf directly, but if you want to draw multiple samples, it can be a waste of time to regenerate the per-base data each time. Saving to a separate file makes for easy re-use (re-sampling).

ADD COMMENTlink modified 2.5 years ago • written 4.8 years ago by Alex Reynolds27k

I think Nicolas Rosewick is looking for single positions from within the regions, not just random regions.

ADD REPLYlink written 4.8 years ago by Devon Ryan88k

I have added code for dealing with the second case via the command-line.

ADD REPLYlink written 4.8 years ago by Alex Reynolds27k

Thanks Alex. That's perfect. I adapted a little bit your code in a bash script :

To call : ./getNRandPosFromBed.sh input.bed 10 output.bed

 

#!/bin/bash

#
# getNRandPosFromBed.sh
# get N random position from regions specified in a bed file

# params 
# $1 = in.bed
# $2 = N : number of positions to output
# $3 = output.bed

TFILE="/tmp/$(basename $0).$$.tmp"

awk '\
    BEGIN { \
        OFS = "\t"; \
    } \
    { \
        chr = $1; \
        start = $2; \
        stop = $3; \
        if (NF > 3) { \
            remainder = $4; \
            for (rIdx = 5; rIdx <= NF; rIdx++) { \
                remainder = remainder"\t"$rIdx; \
            } \
        } \
        for (sIdx = start; sIdx < stop; sIdx++) { \
            if (NF > 3) { \
                print chr, sIdx, (sIdx + 1), remainder; \
            } \
            else { \
                print chr, sIdx, (sIdx + 1); \
            } \
        } \
    }' $1 > $TFILE

shuf $TFILE | head -n $2 | awk '{print $1"\t"$2}' | sort -k1,1V -k2,2n > $3

rm $TFILE

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by Nicolas Rosewick7.3k
1

Thanks for the feedback. A couple thoughts:

1. If you plan on taking more than one sample from the split regions file, consider saving it outside the /tmp directory to save time from having to recreate it each time. Compress it, if it takes up a lot of space.

2. I have done some performance testing on a few "typical" Linux and OS X hosts, and GNU sort can be about 2-3 times as slow as BEDOPS sort-bed at sorting BED files. For large files or when doing lots of sorting, this wasted time can add up.

If your sample is large, consider using sort-bed directly, or if you are doing a non-typical sort, add --parallel=n to the GNU sort call if you are sorting on a multiprocessor machine.

Even with --parallel, BEDOPS sort-bed usually runs faster than GNU sort. But if you have to use it, use --parallel where you can. (Disclaimer: I am one of the authors of BEDOPS, so take my claims with the necessary skeptical grains of salt and do testing on your end.)

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by Alex Reynolds27k

In fact, sorting is not essential for me. I just want to pick random positions to perform a statistical test. But thanks for the information about GNU sort and sort-bed from bedops. 

ADD REPLYlink written 4.8 years ago by Nicolas Rosewick7.3k
5
gravatar for Devon Ryan
4.8 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

If no one posts a premade tool, then the following Rscript will work from the command line (usage script.R file.bed N, where N is the desired number of regions). Note that positions won't occur more than once (this can be trivially changed).

 

#!/usr/bin/env Rscript
suppressPackageStartupMessages(require(GenomicRanges))
suppressPackageStartupMessages(require(rtracklayer))

args <- commandArgs(T)
if(length(args) != 2) {
    print(args)
    stop("Not enough arguments. Usage: script.R foo.bed N, where N is an integer")
}

b <- import.bed(args[1])
l <- sum(width(b))
positions <- sample(c(1:l), args[2])
new_b <- GRanges(seqnames=as.character(rep(seqnames(b), width(b))),
    ranges=IRanges(start=unlist(mapply(seq, from=start(b), to=end(b))), width=1))
export.bed(new_b[positions], "output.bed")

 

This will work OK for bed files that aren't terribly large (such that "new_b" doesn't get enormous).

ADD COMMENTlink written 4.8 years ago by Devon Ryan88k

Thanks Devon. I'll try it now and give you a feedback.

ADD REPLYlink written 4.8 years ago by Nicolas Rosewick7.3k

The problem is that my bed file is quite big ( cpg island for hg19 :p )

ADD REPLYlink written 4.8 years ago by Nicolas Rosewick7.3k

That should still be OK unless you're pretty memory limited. The limit with this method is that the new_b object can get quite large and if you only a couple gigs of memory then that'll cause problems (I actually don't know how much space a GRanges object occupies by row). There are alternative implementations of the penultimate line if the memory usage becomes an issue.

ADD REPLYlink written 4.8 years ago by Devon Ryan88k
0
gravatar for Ryan Dale
4.8 years ago by
Ryan Dale4.8k
Bethesda, MD
Ryan Dale4.8k wrote:

Obligatory BEDTools solution: First make a file containing features of the size you'd like to use, and then shuffle them only within the regions of interest.  Specifically:

1. Make a file with N features of the size you want, here, 50 features of 1 bp each. The chromosome, start, stop don't matter since you'll be shuffling them anyway:

    for i in {1..50}; do echo -e "chr1\t1\t2" >> a.bed; done

2. Then shuffle them.  Using -incl will only shuffle them within myregions.bed

    bedtools shuffle -i a.bed -incl myregions.bed -g hg19

Note the genomes file is only used here by shuffle in case myregions.bed has regions outside the chromosome boundaries.

ADD COMMENTlink written 4.8 years ago by Ryan Dale4.8k

Dear all,

I want a slight modification to what Nicolas Rosewickasked in the beginning, so asked a question on existing thread instead of new one.

I have a region of my interest "targets.bed". Now i want to find equal number and same length (6-8nt) of random sequences from the same 3'utr. Can you suggest on how to get this on what i have been trying.

 

shuffleBed -chrom -excl targets.bed -i utr3p -g assembly.size

This provides me the remaining part of 3putr which do not overlap targets.bed. But it has two limitations

1) Size of random region is not in the range of targets.bed

2) Number of random region generated is not equal to count of targets.bed but equal to number of utr3p.

I also tried to input shuffleBED output to randomBED, but was useful. Any hep will be highly appreciated.

cheers

Chirag

 

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Chirag Nepal2.2k

1) and 2) are expected given the command line you provided.  

So instead of -excl targets.bed, try -incl targets.bed.

 

ADD REPLYlink written 4.6 years ago by Ryan Dale4.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1598 users visited in the last hour