Question: How To Randomly Sample A Subset Of Lines From A Bed File
4
gravatar for bede.portz
5.1 years ago by
bede.portz470
United States
bede.portz470 wrote:

How do I randomly sample lines from a bed file? Specifically, I want to make a smaller bed file of ChIP-seq reads from a larger one, while maintaining the relative proportion of lines from each chromosome.

Thanks,

Bede

bed chip-seq • 7.8k views
ADD COMMENTlink modified 4.4 years ago by monique.sakalidis0 • written 5.1 years ago by bede.portz470
11
gravatar for Alex Reynolds
5.1 years ago by
Alex Reynolds26k
Seattle, WA USA
Alex Reynolds26k wrote:

To preserve the relative proportion of elements per chromosome, I think you will need to do a bit more work.

Overall procedure:

  1. Split the BED input into per-chromosome files
  2. Count the number of lines in each per-chromosome file
  3. Calculate the proportion of lines in each per-chromosome file, based on the total number of lines in the input BED
  4. Sample from each per-chromosome file separately, based on these proportions

Using BEDOPS sort-bed, first sort your BED data:

$ sort-bed elements.unsorted.bed > elements.bed

Split it with bedextract and count lines. The bedextract tool very efficiently splits BED files into per-chromosome files, and we put per-chromosome line counts into a text file called counts.txt:

$ for chr in `bedextract --list-chr elements.bed`; do \
    bedextract $chr elements.bed > elements.$chr.bed; \
    count=`wc -l elements.$chr.bed | cut -d' ' -f1`; \
    echo -e "$chr\t$count"; done \ 
> counts.txt

Let's assume that we ultimately want to sample a total of 1000 elements from the input BED file. You can change this value to whatever you need.

We next calculate a table of per-chromosome proportions and how many elements you will pull from each chromosome:

$ sampleSize=1000
$ sum=`cut -f2 counts.txt | perl -nle '$sum += $_ } END { print $sum' -`
$ awk -v sum=$sum -v sampleSize=$sampleSize '{print $0"\t"($2/sum)"\t"int(sampleSize*$2/sum)}' counts.txt > proportions.txt

Finally, do a random sample from each of the per-chromosome BED files, choosing a sample size for each chromosome that reflects its proportion in the larger input:

$ awk '{ \
    perChrName=$1; \
    perChrN=$4; \
    cmd="shuf elements."perChrName".bed | head -n "perChrN; \
    system(cmd); \
}' proportions.txt \
| sort-bed - \
> sample.bed

The output is a sort-bed-sorted BED file called sample.bed. This file contains roughly 1000 rows (this will be close, but not exact, depending on the cutoff values from the int() call in the awk statement where we calculate proportions). The number of samples from each chromosome will therefore roughly reflect the distribution of elements in the larger input BED file.

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Alex Reynolds26k

This is a more thorough solution that seemingly ensures I equally represent each chromosome. However, it seems to me that if shuf it completely random, than data from each chromosome will be distributed randomly through the resulting file, thus taking head -somenumber will give me an approximately representative sampling?

ADD REPLYlink written 5.1 years ago by bede.portz470
1

Being completely random and maintaining your proportions are incompatible goals. After all, a single random sampling is just as likely to return entirely reads from chromosome 4 as it is to be distributed across all chromosomes.

ADD REPLYlink written 5.1 years ago by Chris Miller20k
1

Point taken, I will try to be more precise, and less colloquial, when seeking help in the future.

ADD REPLYlink written 5.1 years ago by bede.portz470

But this is not a single random sampling. If the OP wants to subset 10,000 lines then he is performing 10,000 random samplings out of the original population without replacement. Assuming a good distribution of reads across chromosomes in the original population (BED file) the probability of sampling 10,000 times and only collecting reads from chromosome 4 is vanishingly small.

ADD REPLYlink written 5.1 years ago by kmcarr00270

You're not wrong, but if you really want to maintain the proportion with high fidelity, then you have to randomly sample an appropriate number from each chromosome. Whether random sampling from a shuffled file is 'good enough' will depend on the specific application.

ADD REPLYlink written 5.1 years ago by Chris Miller20k

I attempted to use this particular approach but for some reason only pulled intervals from chr1-3 and X, Y when doing this, while the bed file contains all of the chromosomes. Any reason you can see as to why this might occur? Thanks!

ADD REPLYlink written 28 days ago by rbronste200

Not sure. Is everything sorted?

ADD REPLYlink written 28 days ago by Alex Reynolds26k

Actually have figured that out and applied it successfully except for one thing, I input a list of about 800K peaks and am looking for that 1000 thats outlined in your code however I get about 15K, here is how I ran it:

for chr in `bedextract --list-chr BNST_merged500kPeaks_allEB.bed`; do \
    bedextract $chr BNST_merged500kPeaks_allEB.bed > BNST_merged500kPeaks_allEB.$chr.bed; \
    count=`wc -l BNST_merged500kPeaks_allEB.$chr.bed | cut -d' ' -f1`; \
    echo -e "$chr\t$count"; done \ 
> BNST_merged500kPeaks_allEB_counts.txt

sampleSize=1000
sum=`cut -f2 counts.txt | perl -nle '$sum += $_ } END { print $sum' -`
awk -v sum=$sum -v sampleSize=$sampleSize '{print $0"\t"($2/sum)"\t"int(sampleSize*$2/sum)}' BNST_merged500kPeaks_allEB_counts.txt > BNST_merged500kPeaks_allEB_proportions.txt

awk '{ \
    perChrName=$1; \
    perChrN=$4; \
    cmd="shuf BNST_merged500kPeaks_allEB."perChrName".bed | head -n "perChrN; \
    system(cmd); \
}' BNST_merged500kPeaks_allEB_proportions.txt \
| sort-bed - \
> BNST_merged500kPeaks_allEB_downsample.bed

Im wondering if the following in the counts file might have a bearing on this and is there a way to remove the Un and Random at this stage of analysis and only focus on main mouse chromosomes:

chr1_GL456210_random    1
chr1_GL456211_random    5
chr1_GL456212_random    3
chr1_GL456221_random    1
chr2    59105
chr3    49119
chr4    47234
chr4_GL456216_random    14
chr4_JH584295_random    1
chr5    47250
chr5_JH584299_random    3
chr6    46638
chr7    40543
chr8    42224
chr9    41203
chrUn_GL456239  11
chrUn_GL456359  5
chrUn_GL456360  1
chrUn_GL456366  8
ADD REPLYlink modified 28 days ago • written 28 days ago by rbronste200

You could use grep to remove non-nuclear elements:

$ grep -v -e "^chr.*_.*" in.txt > out.txt

This is GNU grep so make changes as needed for OS X etc.

ADD REPLYlink modified 28 days ago • written 28 days ago by Alex Reynolds26k

Thanks! One final thing on this, wondering how to subsample intervals of similar lengths lets say an avg of 350bp?

ADD REPLYlink written 11 days ago by rbronste200
2

Just spitballing:

You could use a normal distribution model in R to generate a pool of interval lengths with a mean length of 350 bases, with a SD of 50:

lengths <- rnorm(n=1234, mean=350, sd=50)

Or you could use whatever distribution that best models your random intervals.

Generate a pool of start positions for a given chromosome (say, chr1 from hg38):

starts <- runif(n=1234, min=0, max=(248956422 - 350 - 1))

Subtracting 350 from the max ensures that a random element does not extend past the edge of chr1 (bounds). Subtracting another 1 ensures the interval is zero-indexed, half-open.

For each start position, you add the random length, converted to an integer value:

stops <- starts + unlist(lapply(lengths, as.integer))

Then make a vector of chromosomes:

chrs <- rep('chr1', 1234)

And bundle the sample into a data frame:

df <- data.frame(chrs, starts, stops)

From there you could export this to a text file or GenomicRanges object (maybe?).

This would need a little bit of work to package into a function for each chromosome and its assembly's bounds, but hopefully this gives an idea of what you could perhaps do.

ADD REPLYlink modified 11 days ago • written 11 days ago by Alex Reynolds26k

I guess I can also take the aggregate bed files representing all chromosomes before downsampling and simply subsample intervals of 350bp in length and thereafter use your approach at the start of this thread to subsample those evenly across all chromosomes? Is there an easy way of taking just a BED file and pulling out intervals that average 350bp in length?

ADD REPLYlink modified 11 days ago • written 11 days ago by rbronste200
1

The bedmap tool has options to report the sizes of reference and/or mapped elements, which you could use with awk to report elements of length 350:

$ bedmap --echo --echo-ref-size --delim '\t' elements.bed | awk '($NF == 350)' > subsample.bed

Or elements within some CI around 350:

$ bedmap --echo --echo-ref-size --delim '\t' elements.bed | awk '(($NF > 300) && ($NF < 400))' > subsample.bed

Etc.

ADD REPLYlink written 11 days ago by Alex Reynolds26k

Perfect, thanks! Second option is what I am looking for in this instance.

ADD REPLYlink written 11 days ago by rbronste200

One other quick thing, is it possible in this scenario to match the subsampled ~350bp peaks in their read counts to the target peak set? Im basically working with about 10K target peaks at some varying read depth and would like to take the downsampled 10K peaks across all chromosomes that I have at 350bp and match their read depth to the target? Thanks in advance for any advice!

ADD REPLYlink written 1 day ago by rbronste200
1

Rejection sampling might work. You would generate an element of desired average length, measure reads, and either keep or discard, depending of whether that measurement falls within the bounds of the read statistics for your target peak set. Keep sampling until you have as many elements as you need to do your experiment, which match your criteria. https://en.m.wikipedia.org/wiki/Rejection_sampling

ADD REPLYlink modified 1 day ago • written 1 day ago by Alex Reynolds26k

Do you know of a package or tool that assists in rejection sampling or would you script it out yourself? Thanks!

ADD REPLYlink written 1 day ago by rbronste200

I would just script it. It's just the equivalent of a do...while loop. taking samples that meet your criteria and ignoring the rest, until you hit some predefined limit:

validSamples = [] // empty list of valid samples
N = 1234 // desired number of samples

// draw samples until you get enough that meet criteria
do {
    sample = drawSample()
    if sampleMeetsCriteria(sample) {
        addSampleToListOfValidSamples(sample)
        validSampleCount += 1
    }
    if validSampleCount >= N {
        break
    }
} while(True)

// do stuff with list of valid samples...

You basically need to write a sampleMeetsCriteria function. Here, that means an element having a certain number of reads that are in line with what is in your target peaks, perhaps.

ADD REPLYlink modified 1 day ago • written 1 day ago by Alex Reynolds26k
5
gravatar for Gabriel R.
5.1 years ago by
Gabriel R.2.5k
Center for Geogenetik Københavns Universitet
Gabriel R.2.5k wrote:

cat input.bed |shuf | head -n [number of lines you want]

ADD COMMENTlink written 5.1 years ago by Gabriel R.2.5k
1

This wont maintain the chromosome order (same number of lines from each chromosome, which can be or not shuffled).

This would require, sort -u with cut on the first column to fetch unique elements and then running a combination of grep and head for each element to generate the output file, which can be randomized using shuf, there might be a better solution as the file can be huge > 50 Million lines.

ADD REPLYlink written 5.1 years ago by Sukhdeep Singh9.5k

Based on the comments, it seems that a simple solution is:

$ cat input.bed | shuf > shufinput.bed
$ cat shufinput.bed | head -n > randomsubset.bed

This solution should maintain the sorted order of the lines in input.bed, while also generating a random subset of data that should equally represent each chromosome in randomsubset.bed.

However, trying this on a test set of data the ratios weren't maintained.

ADD REPLYlink written 5.1 years ago by bede.portz470
1

Based on how your question is specifically worded, I still don't think this is what you want, but if this is what you want, Irsan is correct that this is a useless use of cat. Just operate on the files directly and pipe the results from one command to the next:

$ shuf input.bed | head -n 1234 - > sampleFromEntireSet.bed

As you note, this will not preserve the relative proportions of elements, just like permuting a deck of sorted playing cards will destroy any ordering of suits and ranks. If you want to preserve ratios (i.e., get a randomly sampled subset of ranks for a given suit) then you'll need to permute subsets (i.e., first split the deck of cards by suit, before sampling for a subset of ranks within a suit).

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by Alex Reynolds26k
1

useless use of cat

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by Irsan6.8k

Thanks, SamuelL

ADD REPLYlink written 5.1 years ago by bede.portz470
1
gravatar for Chris Miller
5.1 years ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

You could also just grab every nth line (which is less random, but will maintain your proportions).

#grabs every 100th line, starting at line 0
cat input.bed | sed -n '0~100p' >output.bed
ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Chris Miller20k
0
gravatar for Sukhdeep Singh
5.1 years ago by
Sukhdeep Singh9.5k
Netherlands
Sukhdeep Singh9.5k wrote:

I wanted to have an one liner for this little tricky problem. So, I consulted Stack Overflow resulting in :

Try using awk

#!/bin/bash
numRow=3
awk 'n[$1]<'$numRow' {a[$1]=a[$1]"\n"$0; n[$1]++} END { asort(a,b); for (x in b) print b[x] }' file

Output:

chr1    3003204 3003454 *   37  +
chr1    3003235 3003485 *   37  +
chr1    3003148 3003152 *   37  -

chr11   71863609    71863647    *   37  +
chr11   71864025    71864275    *   37  +
chr11   71864058    71864308    *   37  -

chrY    90828920    90829170    *   23  -
chrY    90829096    90829346    *   23  +
chrY    90828924    90829174    *   23  -

If randomized order of lines is required inside the groups, then , just a little use of shuf

#!/bin/bash
numRow=3
shuf file | awk 'n[$1]<'$numRow' {a[$1]=a[$1]"\n"$0; n[$1]++} END { asort(a,b); for (x in b) print b[x] }'

Very fast and efficient solution

Time taken for a file of 10 Milllion reads

real    0m2.923s
user    0m2.859s
sys    0m0.063s

Ref : http://stackoverflow.com/questions/19690954/shuffling-a-large-text-file-without-with-group-order-maintained

Credits : jkshah

Cheers

ADD COMMENTlink written 5.1 years ago by Sukhdeep Singh9.5k
0
gravatar for Alex Reynolds
4.5 years ago by
Alex Reynolds26k
Seattle, WA USA
Alex Reynolds26k wrote:

I wrote a tool called sample (Github site) that uses reservoir sampling to pull an ordered or randomly-shuffled uniformly random sample from an input text file delimited by newline characters. It can sample with or without replacement.

When sampling without replacement, this application offers a few advantages over common shuf-based approaches:

  • It performs roughly 2.25-2.75x faster than shuf in informal tests on OS X and Linux hosts.
  • It uses much less memory than the usual reservoir sampling approach that stores a pool of sampled elements; instead, sample stores the start positions of sampled lines (roughly 8 bytes per line).
  • Using less memory also gives sample two advantages:
  1. Avoid shuf: memory exhausted errors for whole-genome-scale input files.
  2. Allow a larger pool of samples, before running out of system memory. For instance, a 2 GB allocation would allow a sample size up to ~268M random elements.

The sample tool stores a pool of line positions and makes two passes through the input file. One pass generates the sample of random positions, while the second pass uses those positions to print the sample to standard output. To minimize the expense of this second pass, we use mmap routines to gain random access to data in the (regular) input file on both passes.

The benefit that mmap provided was significant. For comparison purposes, we also add a --cstdio option to test the performance of the use of standard C I/O routines (fseek(), etc.); predictably, this performed worse than the mmap-based approach in all tests, but timing results were about identical withgshuf on OS X and still an average 1.5x improvement over shuf under Linux.

The sample tool can be used to sample from any text file delimited by single newline characters (BED, SAM, VCF, etc.). Also, using the --lines-per-offset option allows sampling and shuffling repeated multiples of newline character-delimited lines, useful for sampling from (for example) FASTQ files, where records are split across four lines.

By adding the --preserve-order option, the output sample preserves the input order. For example, when sampling from an input BED file that has been sorted by BEDOPS sort-bed — which applies a lexicographical sort on chromosome names and a numerical sort on start and stop coordinates — the sample will also have the same ordering applied, with a relatively small O(k logk) penalty for a sample of size k.

By omitting the sample size parameter, the sample tool will shuffle the entire file. This tool can be used to shuffle files that shuf has trouble with; however, in this use case it currently operates slower than shuf (where shuf can still be used). We recommend use of shuf for shuffling an entire file, or specifying the sample size (up to the line count, if known ahead of time), when possible.

One downside at this time is that sample does not process a standard input stream; the input must be a regular file.

ADD COMMENTlink modified 4.4 years ago • written 4.5 years ago by Alex Reynolds26k

Hi Alex

I would like to use reservoir sample to randomly sample 10000 SNPs from a VCF file and produce and new VCF file with those SNPs. Can reservoir sample do this? Also could you please provide some installation instructions?

Thanks!

Monique

ADD REPLYlink written 4.4 years ago by monique.sakalidis0

I think the following should work:

$ git clone https://github.com/alexpreynolds/sample.git
...
$ cd sample
$ make
...
$ ./sample -k 10000 /path/to/snps.vcf > /path/to/snps_subset.vcf

I've added some features recently, but the current code should work.

Note that I changed to default behavior (as described above) from sample-in-order to sample-and-shuffle. If you need to preserve order, add the --preserve-order operand.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Alex Reynolds26k

This is perfect, thanks so much for the quick response!

Perhaps this is beyond the capabilities of reservoir-shuffle, but is it possible to subsample the sites for all individuals in the VCF file?

eg. Lets say I have a vcf file that contains 100 individuals and 1 million SNP sites. I would like a new file with all 100 individuals and 10000 SNP sites that have been randomly sampled from the larger file.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by monique.sakalidis0

This is just a generic sample tool. You might use VCFTools to split your parent file up by individual, then draw a random sample from each subset.

ADD REPLYlink written 4.4 years ago by Alex Reynolds26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1449 users visited in the last hour