Binning Of Chromosome Data
Entering edit mode
8.0 years ago

Hi I would like to bin my chromosome data in 1 Mbps bins. The data looks as below:

Chromosome     Start     End
chr1    15892    16421
chr1    17801    17812
chr2    24136    24223

I would like the output to look something like this:

Chr Start End Count
chr1 1 1000000  4
chr1  1000000 2000000 10

I would like to do this in awk. I would appreciate your help. Thanks.

awk • 12k views
Entering edit mode

Look into the modulo (%) operator. A corner case you'll have to handle is intervals that overlap two of your bins.

Entering edit mode
8.0 years ago

Another option with bedtools is to use the makewindows subcommand to make you 1Mb bins. Note that chrom.sizes needs to be in the "genome file" format

bedtools makewindows -g chrom.sizes -w 1000000 > genome.1Mb.bed

Then intersect your data with these windows. Assuming your data file is large, if you presort it (sort -k1,1 -2,2n), the following command will give you what you want. The only caveat being that intervals spanning adjacent 1Mb windows will be counted in each window.

bedtools intersect -a genome.1Mb.bed -b data.sorted.bed -c -sorted
Entering edit mode

How is format of input data file to construct bin mapping.


Entering edit mode
8.0 years ago

BEDOPS bedmap is an excellent tool to solve this problem. You can use the bedmap --count operation to count the number of elements that overlap your bins.

First, you need to prepare your data file for use with BEDOPS tools. Secondly, you need to prepare a set of bins with awk. Then you can use bedmap. I'll show you how you can do these steps.

Let's set up your data file — let's call it myRawData.txt. You need to strip the first line (those column headers are unnecessary) and then sort that result with BEDOPS sort-bed:

$ tail -n +2 myRawData.txt \
    | sort-bed - \
    > mySortedData.bed

You need to sort the data, but this only needs to be done once. This gets your data ready for use with BEDOPS tools, which take advantage of the ordering in sorted data to operate fast and with a low memory profile, compared with alternative tools.

Now let's set up your bins. Let's say you are working with the human hg19 assembly. You'll need the bounds of each chromosome in hg19. To do this, retrieve the hg19 version of the chromInfo table from the UCSC Genome Browser:

  1. Visit the UCSC Table Browser. With the All Tables group selected, for example, select the hg19 database and the chromInfo table. Output all fields to a text file. (This step can also be performed with Kent-tools' hgsql commands, if this needs automating.)
  2. Edit this text file (e.g. run awk on it to put in a start coordinate of 0) and run this through sort-bed to turn it into a sorted BED file.

Here's a ready-to-use example for hg19 that you can download, which I have made for demonstration purposes. Here's what the first few lines look like:

chr1    0    249250621
chr10    0    135534747
chr11    0    135006516
chr11_gl000202_random    0    40103
chr12    0    133851895

Now that we have the bounds for our bins, let's make the bins! We'll save that chromosome list to a file called chrList.bed and use its region data to define bins:

$ awk ' \
    BEGIN \
    { \
        binSize = 1000000; \
        binIdx = 0; \
    } \
    { \
        chr = $1; \
        start = $2; \
        stop = $3; \
        for (binStart = start; binStart < (stop - binSize); binStart += binSize) { \
            print chr"\t"binStart"\t"(binStart + binSize)"\tbin-"binIdx; \
            binIdx++; \
        } \
    }' chrList.bed \
    > myBins.bed

Now that we have our two sorted inputs: mySortedData.bed and myBins.bed, we're ready to use bedmap --count to answer your question:

$ bedmap --echo --count --fraction-map 0.51 myBins.bed mySortedData.bed > myAnswer.bed

This command maps elements in mySortedData.bed over binned regions in myBins.bed, counts any overlapping elements that meets the overlap criteria (described below), and prints the bins and their respective counts to the file myAnswer.bed.

This result file myAnswer.bed is a sorted BED file and can be consumed by other BEDOPS or command-line tools in a larger analysis pipeline.

Note that we used the --fraction-map operator with a parameter of 0.51. This means that a "map" element in mySortedData.bed must overlap a region in myBins.bed by 51% or more of the mapped element's length. This deals with the edge case where an element in mySortedData.bed could straddle two bins.

We set this overlap criteria, because the default overlap criteria in bedmap is one or more bases of overlaps between an element and a binned region. This would double-count elements which overlap two bins, which you probably do not want.

Using --fraction-map 0.51 forces an element to fall into one or the other bin, depending on which has greater relative overlap.

If you do want to double-count, then just remove the --fraction-map 0.51 setting.

Another option is to pre-process your raw data input; you might instead process it into single-base elements, taken from the first base, e.g.:

$ tail -n +2 myRawData.txt \
    | sort-bed - \
    | awk '{ print $1"\t"$2"\t"($2+1); }' - \
    > mySortedFirstBasesData.bed

Because the bins are disjoint, you can then use bedmap with its default overlap setting, and each element will associate with one and only one bin, where there is overlap between the element's first base and that bin's genomic range.

However you choose to handle overlap cases, you will want to understand how your elements are being counted, as this can often lead to noise or errors that complicate results generated further downstream in your analysis.

Entering edit mode
8.0 years ago

I would suggest using bedtools, bedops, or Bioconductor GenomicRanges or any of a number of other libraries designed for genomic queries like this. Generally, it will be as simple as making a bed file of your "bins" and calling a tool given your first file and the bins as input. You can probably do this in awk, but it would not be my first choice.

Entering edit mode
8.0 years ago
Ido Tamir 5.2k

No need for fancy tools. Who wants to learn all those complicated APIs that break on every release. Simply divide by your bin size, count and expand the bins (only works on sorted input):

awk '{ $2=int($2/1000000); print $1 "\t" $2 }' test.bed | uniq -c | awk '{ print $2 "\t" $3*1000000+1 "\t" $3*1000000+1000000 "\t" $1}'

Hope I got it correct... Maybe I should learn some APIs and do unit tests after all...

Entering edit mode

Nice. Only caveat being that intervals that span two windows are only counted in the lower window.


Login before adding your answer.

Traffic: 1857 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6