The way I have made these kinds of plots is:
- Generate positions of a set of junctions (or other features of interest: TF binding sites, etc.) and separating into strand-specific subsets, if necessary
- Calculate windows around feature positions with
bedops --range
or similar
- "Melt" the window intervals into per-base elements, using Perl or Python etc.
- Score the per-base elements against signal files (here, normalized reads) with
bedmap --echo-map-score
or similar, replacing NAs with zeroes, if appropriate, and applying any strand-specific signal reorientation or shift adjustments, if needed
- Merge per-base element scores into a matrix file, one row per windowed feature, using scripting language of choice
Once I have this matrix, I can plot in R:
- Import matrix into R variable
- Calculate column means with
colMeans()
(or calculate other columnar statistic: medians, etc. using apply()
)
- Create empty plot with
plot()
- Use
polygon()
to generate shaded region around bases of interest
- Render axes with
axis()
- Draw columnar means (or whatever statistic) with
lines()
Once the overall process is set up, these steps are pulled into a makefile
for pipelining and putting on a cluster.
If you have a specific question about any part of this, feel free to follow up and I can see how much detail I can provide.
Generating windows around BED elements
To generate flanking regions around elements, I use bedops --range
with some chosen size value for flanking regions. For instance, to generate a 150 bp window (300 bp in total) around a set of single-base junctions:
$ bedops --everything --range 150 junctions.bed > 150bp_padded_junctions.bed
This operation preserves all columns, including ID and strand data. I can then melt this padded file into per-base elements, score the per-base elements, and build a matrix for statistical analysis or heatmaps, etc.
Melting a BED file
To "melt" a BED6 (six-column BED) file into per-base elements, I use the following script:
To use it:
$ melt.pl < elements.bed > per-base.elements.bed
This "melts" the BED file into single-base bins. You could modify this script to generate any bin size you like, but you'd want to think about the edge case where an interval is not divided evenly by a chosen bin size, and how that might affect calculating statistics.
The ID field is changed. It becomes originalID!index
, where originalID
is the ID field of the original element, and index
is a unique integer value that identifies and groups the per-base elements.
I use !
as a delimiter that is unlikely to be found as a character in the names of genes or other elements I am melting. The idea here is that I can group per-base elements after scoring with bedmap, generate my matrix file, and recover the original ID, if needed.
You can also use bedops --chop
here:
$ bedops --chop 1 elements.bed > per-base.elements.bed
However, this will strip fourth and subsequent columns, removing ID and strand information. I find it useful - for the purposes of this specific pipeline, for making these types of plots - to create a custom ID that groups per-base elements made from the original element, which is easier to do with a scripted approach.
This is actually similar to what I had planned to do, I am just working on generating the exon positions with orientation. I'll post back here to let you know how it goes.