BEDOPS is a suite of tools to address common questions raised in genomic studies — mostly with regard to overlap and proximity relationships between data sets. It aims to be fast and flexible, facilitating the efficient and accurate analysis and management of large-scale genomic data.
The second major release of BEDOPS includes several new features which focus on improving how we handle arbitrarily large datasets, namely through compression and parallelization.
We have recently released BEDOPS v2.2. This includes fixes for an unstarch
row cutoff bug, restoration of starchcat
's gzip
-backing support, as well as a reversion to the C-based wig2bed
conversion utility to restore performance lost with the Python script. For those who make use of the source code download, we have also added a test suite for the Starch toolkit; see the starch/test/README
documentation and makefile
for more information.
We strongly recommend updating to this latest version. 64-bit Linux and 32-/64-bit Mac OS X installer packages are available from the Google site. Source code is also available from the Google site's SVN service; you will want to use gcc
4.7.2 or 4.8.0 to compile this suite.
All changes are summarized on the Google Code site.
Released in early May 2013, BEDOPS v2.1.1 features include:
- Significant performance enhancements to
bedmap
. - Bug fixes for
bedops --partition
. - Improved error handling in Python-based
*2bed
conversion scripts (includingwig2bed
). - Other minor fixes.
Here is a summary of v2.1.0 features released in April 2013:
bedops
New
--partition
operatorThis operator will efficiently split overlapping inputs and report disjoint segments that partition the shared genomic space.
To demonstrate, say you have a few input BED files (sorted with BEDOPS
sort-bed
) or equivalent Starch archives. Together they have coordinate segments onchrN
that look like:------------------------------------ --------------- ------------------ ------------------------------------- ----
The output from
--partition
on these inputs would be:--- ---- ----------- -------- ---- ---- --- -------
One example of where this is useful is in finding intersections of elements within a single BED file, which was not possible with BEDOPS tools until now. Consider the following usage, where
input.bed
is a sorted BED file that we want to "self-intersect":$ bedops --partition input.bed \ | bedmap --count --echo - input.bed \ | awk -F"|" '($1 > 1) { print $2; }'
A "real-world" application of this feature is in comparing paired-end reads, where the goal is to facilitate a quick search for abnormal insertions (or, conversely, deletions) between two sequencing experiments.
(Thanks to Shane for the usage tip.)
starch
- Improved error checking for interleaved records
Conversion scripts
All scripts now use BEDOPS
sort-bed
behind the scenes to output sorted BED output, ready for consumption by BEDOPS utilities likebedextract
,bedmap
,bedops
andclosest-features
.In other words, it is no longer necessary to pipe converted output to
sort-bed
before piping to other BEDOPS utilities.New
psl2bed
conversion script, converting PSL-formatted UCSC BLAT output to BED.New
wig2bed
conversion script written in Python.New
*2starch
convenience scripts offered for all*2bed
scripts, which convert data and output Starch v2 archives.
Improved Mac OS X support
New installer package makes installation of BEDOPS binaries and scripts much easier for OS X 10.6 - 10.8 hosts.
Installer resolves fatal library errors seen by some end users of older OS X BEDOPS releases.
This release also includes major BEDOPS v2 features, such as:
Support for BEDOPS Starch archives with main toolkit
- The
bedops
,bedmap
,bedextract
andclosest-features
tools now all accept Starch-formatted files as inputs, as well as UCSC BED files, as before. (In other words, it is no longer necessary to extract Starch data to intermediate files before applying set or statistical operations.)
- The
Very efficient single-chromosome operations
- New
--chrom
operator applies set, statistical or ID operations to specified chromosome withbedmap
,bedops
andclosest-features
, without needing to stream through the entire BED file. This is highly useful for parallelization tasks on very large BED data.
- New
bedmap
New
--echo-map-id-uniq
operator lists unique ID values from mapped elements.New
--max-element
and--min-element
operators return the highest or lowest scoring overlapping map element.
sort-bed
- New
--max-mem
option limits sorting to specified memory, useful for sorting large BED inputs larger than system memory.
- New
starch
,unstarch
andstarchcat
BEDOPS Starch v2 archives contain useful, precomputed metadata that can improve the efficiency of scripts.
For instance, calling
unstarch --elements
on a Starch v2 archive shows the total number of records in the entire file or for any individual chromosome, whileunstarch --bases
andunstarch --bases-uniq
give the number of total and unique bases covered by elements in the whole archive or over elements of the specified chromosome. These latter two options are analogous to those already available inbedmap
.As an example, using the
--elements
operator on a Starch v2 archive made from DNaseI-seq or RNAseq tag data would return the total number of reads over the entire BED file. Using--elements chr3
would return the total number of tags in chromosomechr3
.Values are precomputed and stored in the archive's metadata, allowing practically instantaneous retrieval. Going back to
--elements
again, this option is much, much faster than extracting data and piping it towc -l
.New checksum data help validate the integrity of the archive and its metadata.
Other metadata enhancements to Starch-format archival and extraction, including:
--note
,--list-chromosomes
,--archive-timestamp
,--archive-type
and--archive-version
.Added 20-35% performance boost to creating Starch archives with
starch
utility.New documentation with technical overview of the Starch format specification.
Conversion scripts
- New
gtf2bed
conversion script, converting GTF (v2.2) to BED.
- New
Overall improvements in 64-bit type handling and error checking
- Consistency across the codebase helps ensure that all BEDOPS applications can scale to arbitrarily large genomes.
+1 for
--chrom
--echo-map-id-uniq
operatorsHi Alex, the BEDOPS sort-bed command is light-speeds faster than GNU sort. The problem for me is and probably anyone else working on DNA methylation, SNPs or other single base features is that it spits back an error when the start and stop coordinates are the same in the input bed file. If there was a way to make it accept bed files with stop and stop co-orindates the same position, it would make it a much more useful tool for people like me. Thanks for your work on BEDOPS.
Ethan
One easy way to fix BED files with zero-length elements (where start = stop index) is to subtract one base from the start position with
awk
or similar, e.g.: