Process your bam file into a sorted BED file with your 5' ends, using BEDOPS bam2bed:
$ bam2bed < foo.bam \
| awk '{ \
if ($6 == "+") { \
print $1"\t"$2"\t"($2+1)"\t"$4"\t"$5"\t"$6; \
} \
else { \
print $1"\t"($3+1)"\t"($3+2)"\t"$4"\t"$5"\t"$6; \
} \
}' - \
> sortedFivePrimeEnds.bed
(If you're using BEDOPS v2.1, then it is not necessary to post-process with BEDOPS sort-bed. If you're using an older installation, pipe to sort-bed before piping to awk.)
Now use BEDOPS bedmap to count 5' ends within regions. We assume unsortedRegions.bed is an unsorted, four-column BED file containing the coordinates of the region and an ID describing the region. In case there are more columns, I use cut to select the first four columns:
$ sort-bed unsortedRegions.bed | cut -f1-4 - > sortedRegions.bed
$ bedmap --echo --count --delim '\t' sortedRegions.bed sortedFivePrimeEnds.bed \
> countsOfFivePrimeEndsOverRegions.bed
If your regions do not have ID values, use awk to add a placeholder ID value (e.g., id-123, etc.) in the sorted output from sort-bed:
$ sort-bed unsortedRegions.bed \
| awk ' \
BEGIN { idx = 0; } \
{ \
print $1"\t"$2"\t"$3"\tidx-"idx; \
idx++; \
} \
' - \
> sortedRegions.bed
In any case, as sortedRegions.bed is a four-column BED file, and because we use the --delim '\t' operator with bedmap, the result file called countsOfFivePrimeEndsOverRegions.bed will contain the number of ends in the fifth (score) column.
Now use awk to measure the average count over each region:
$ awk ' \
{ \
regionStop = $3; \
regionStart = $2; \
regionWidth = regionStop - regionStart; \
endCount = $5; \
averageEnds = endCount / regionWidth; \
print $1"\t"$2"\t"$3"\t"$4"\t"averageEnds; \
}' countsOfFivePrimeEndsOverRegions.bed \
> averageEndsOverRegions.bed
The average ends result is in the fifth (score) column of averageEndsOverRegions.bed.
Once you have an understanding of the process, it is possible to glue together most of these commands into a pipeline with standard UNIX pipes, which will improve speed greatly by removing the expense of creating intermediate files.
If you need to do strand-separation, change the initial bam2bed step appropriately:
$ bam2bed < foo.bam \
| awk '{ \
if ($6 == "+") { \
print $1"\t"$2"\t"($2+1)"\t"$4"\t"$5"\t"$6; \
} \
}' - \
> sortedForwardStrandFivePrimeEnds.bed
$ bam2bed < foo.bam \
| awk '{ \
if ($6 == "-") { \
print $1"\t"($3+1)"\t"($3+2)"\t"$4"\t"$5"\t"$6; \
} \
}' - \
> sortedReverseStrandFivePrimeEnds.bed
Then do the follow-up bedmap and awk processing steps on each file separately, as described.