I would like to introduce pairedBamToBed12, a tool that we created at RIKEN some time ago to represent our paired-end transcriptome data with one single line per pair.
It is based on bedtools; technically, it is a fork with the command pairedBamToBed12 added and all the other commands removed (which means that it might be merged if there would be enough interest). Our source code is on GitHub.
There is a bamtobed tool in bedtools, but it was not fitting our needs, because
bamtobed -split leaves the forward (Read 1) and reverse (Read 2) reads on separate BED12 lines, and the BEDPE format output by
bamtobed -bedpe does not support spliced alignments.
As a brief illustration of what it does:
Read 1: >>>>>>>>>>>> Read 2: <<<<<<<<<<<<<-----<<<<<<< The pair: >>>>>>>>>>>>------>>>>>>>>>>>>>----->>>>>>>
Perhaps the best way to see further what the program does is to look at our regression tests. Our main use for it is to represent our paired-end CAGE data (CAGEscan), before upload to our home-made genome browser, Zenbu, that can represent BED12 files either as conventional intervals, or as quantitative coverage plots of the whole area or the 5′ or 3′ end (the 5′ being particularly relevant for CAGE).
The main limitation of our approach is that it is strongly tied to proper pairing, in particular it can not represent transcripts overlaping multiple chromosomes, as in the case of recombinations, viral insertions, trans-splicing etc. This said, it is not a big problem for projects that are not requiring exploration of de novo transcript patterns. We are currently considering to support the optional use of one read only in case of non-proper pairing, as a compromise workaround.
pairedBamToBed12 is Free software (GPL-2 like bedtools), and I would be excited if it had more users and developers, which is why am writing this post :)
This said, if there is a superior solution, either already implemented or not, I will be very interested to discuss it. In particular, I wonder if in the long term, in order to support recombinations not represented in the reference genome used for alignment, it would be needed to give up on the simplicity of having one pair per line, and switch to a different format such as GFF...
-- Charles Plessy, Tsurumi, Kanagawa, Japan (working at RIKEN, see population-transcriptomics.org).