It is a bit outdated; however, you could use the following BEDOPS usage script as a starting point. That example uses
awk to make windows, but that isn't necessary any longer.
Some recent chop and stagger features in the
bedops tool make it easy to generate windows for signal mapping.
hg19 chromosome extents:
We'll call this file
hg19.bed. It is sorted per
sort-bed and is ready to use.
Chop these extents by 20 bases, and stagger windows by 1 base (or whatever parameter you want to use to slide the window):
$ bedops --chop 20 --stagger 1 -x hg19.bed > hg19.20nt_bin_1nt_slide.bed
Consider adding the
-x (exclude) option, which leaves off any trailing windows less than 20 bases in length at the ends of each chromosome.
Feed the sliding bins or windows to
bedmap, to measure (methylation) signal over each window.
You can pick any of the number of statistical operations available in
bedmap. For example, you could use
--mean to get the mean methylation signal over the sliding window, or
--sum to get the density of methylation signal over the window, etc. Run
bedmap --help or read the documentation for more information about available score/signal operations.
As an example, the following gives the density of methylation signal over each sliding window:
$ bedmap --echo --sum --delim '\t' hg19.20nt_bin_1nt_slide.bed methylation.bed > answer.bed
methylation.bed should be at least a five-column sorted BED file, with the score or signal in the fifth column.
answer.bed will contain each sliding window and a final column containing the sum of signal values of any methylation regions that overlap the sliding window. You could read this file into any number of tools (R, UCSC Genome Browser, etc.) to plot smoothed signal.
Because 20nt is a very fine window size, this will generate very large intermediate and final files.
A couple suggestions for dealing with these issues are to merge steps in a UNIX pipeline and to output to a compressed form of BED called Starch.
The following suggestion uses a pipe character
| to pass the sliding windows to the
bedmap mapping step. The use of the hyphen
- is a placeholder for standard output from the previous
$ bedops --chop 20 --stagger 1 -x hg19.bed | bedmap --echo --sum --delim '\t' - methylation.bed > answer.bed
This approach does the same thing as steps 1 and 2, and runs faster than running steps 1 and 2 separately. It uses much less disk space, because it does not create the intermediate regular file
hg19.20nt_bin_1nt_slide.bed — a file that you probably don't really need to keep around, anyway, so why waste time making it?
A second suggestion is to compress the output with a tool in BEDOPS called
starch. This application removes redundancy in the coordinates and applies a second compression step. BED files get much smaller:
$ bedops --chop 20 --stagger 1 -x hg19.bed | bedmap --echo --sum --delim '\t' - methylation.bed | starch - > answer.starch
You can use
unstarch answer.starch to extract data. You can specify a chromosome of interest, as well:
unstarch chrY answer.starch.
unstarch tool passes uncompressed BED data to standard output, so you can pipe out to downstream processes. Also, other BEDOPS tools natively work with Starch archives as if they were BED files, so you can do set and statistical operations with them directly, without piping:
$ bedmap --echo --echo-map-id-uniq foo.starch bar.bed > baz.bed
Making a compressed archive will add to the overall processing time but reduce disk usage even further. You might start with the initial mapping to an uncompressed BED file, and then explore Starch when disk usage becomes a concern.
Other options include limiting your signal operations to regions of interest. You might leave out RepeatMasked regions from windows, for instance. You could use
bedops --difference to put "holes" in the
hg19.bed extents file, and then do your windowing and mapping from that product. Your calculations will go faster because you don't waste time with regions that may not be biologically interesting or relevant to your experiment.
Hope this helps!