Process your bam
file into a sorted BED file with your 5' ends, using BEDOPS bam2bed
:
$ bam2bed < foo.bam \
| awk '{ \
if ($6 == "+") { \
print $1"\t"$2"\t"($2+1)"\t"$4"\t"$5"\t"$6; \
} \
else { \
print $1"\t"($3+1)"\t"($3+2)"\t"$4"\t"$5"\t"$6; \
} \
}' - \
> sortedFivePrimeEnds.bed
(If you're using BEDOPS v2.1, then it is not necessary to post-process with BEDOPS sort-bed
. If you're using an older installation, pipe to sort-bed
before piping to awk
.)
Now use BEDOPS bedmap
to count 5' ends within regions. We assume unsortedRegions.bed
is an unsorted, four-column BED file containing the coordinates of the region and an ID describing the region. In case there are more columns, I use cut
to select the first four columns:
$ sort-bed unsortedRegions.bed | cut -f1-4 - > sortedRegions.bed
$ bedmap --echo --count --delim '\t' sortedRegions.bed sortedFivePrimeEnds.bed \
> countsOfFivePrimeEndsOverRegions.bed
If your regions do not have ID values, use awk
to add a placeholder ID value (e.g., id-123
, etc.) in the sorted output from sort-bed
:
$ sort-bed unsortedRegions.bed \
| awk ' \
BEGIN { idx = 0; } \
{ \
print $1"\t"$2"\t"$3"\tidx-"idx; \
idx++; \
} \
' - \
> sortedRegions.bed
In any case, as sortedRegions.bed
is a four-column BED file, and because we use the --delim '\t'
operator with bedmap
, the result file called countsOfFivePrimeEndsOverRegions.bed
will contain the number of ends in the fifth (score) column.
Now use awk
to measure the average count over each region:
$ awk ' \
{ \
regionStop = $3; \
regionStart = $2; \
regionWidth = regionStop - regionStart; \
endCount = $5; \
averageEnds = endCount / regionWidth; \
print $1"\t"$2"\t"$3"\t"$4"\t"averageEnds; \
}' countsOfFivePrimeEndsOverRegions.bed \
> averageEndsOverRegions.bed
The average ends result is in the fifth (score) column of averageEndsOverRegions.bed
.
Once you have an understanding of the process, it is possible to glue together most of these commands into a pipeline with standard UNIX pipes, which will improve speed greatly by removing the expense of creating intermediate files.
If you need to do strand-separation, change the initial bam2bed
step appropriately:
$ bam2bed < foo.bam \
| awk '{ \
if ($6 == "+") { \
print $1"\t"$2"\t"($2+1)"\t"$4"\t"$5"\t"$6; \
} \
}' - \
> sortedForwardStrandFivePrimeEnds.bed
$ bam2bed < foo.bam \
| awk '{ \
if ($6 == "-") { \
print $1"\t"($3+1)"\t"($3+2)"\t"$4"\t"$5"\t"$6; \
} \
}' - \
> sortedReverseStrandFivePrimeEnds.bed
Then do the follow-up bedmap
and awk
processing steps on each file separately, as described.