The bedops
and bedmap
utilities in the BEDOPS suite do not generate bins, but you could use awk
to do this, which might be faster than using R
.
To demonstrate how to solve your problem, let's start with a couple example genes foo
and bar
:
$ more genes.bed
chr1 1000 4000 foo
chr1 6000 9000 bar
Here's how we could partition them into, say, five bins for each gene:
$ awk ' \
BEGIN { \
binNum=5; \
} \
{ \
chr=$1; \
start=$2; \
stop=$3; \
binIdx=1; \
binSize=(stop-start)/binNum; \
for (i=start;i<stop;i+=binSize) { \
print chr"\t"i"\t"(i+binSize)"\t"binIdx; \
binIdx++; \
} \
}' \
genes.bed > bins.bed
Here's what our example bins.bed
looks like:
$ more bins.bed
chr1 1000 1600 1
chr1 1600 2200 2
chr1 2200 2800 3
chr1 2800 3400 4
chr1 3400 4000 5
chr1 6000 6600 1
chr1 6600 7200 2
chr1 7200 7800 3
chr1 7800 8400 4
chr1 8400 9000 5
Note that we're making the ID field the bin index — we're not including the name of the gene in the bin index.
If I understand your question correctly (and the rest of this answer depends on this understanding), the actual name of the gene doesn't matter for the final answer — we just want an average signal for a bin across all genes in the set, which are divided into equal numbers of partitions regardless of size.
(Also, for the purposes of demonstration, I'm "cheating" with my choice of gene BED regions and bins, so that everything partitions cleanly. I'll leave it to you to handle edge cases where things aren't so neatly divisible.)
Next, let's set up an example signal file (coverage data). It doesn't matter what's in the ID field — only that something is there, so that the score or signal value is in the fifth column, thus ensuring that our tools are working with a properly UCSC-formatted BED file:
$ more signal.bed
chr1 1000 1500 id-1 1
chr1 1500 2000 id-2 3
chr1 2000 2500 id-3 4
chr1 2500 3000 id-4 4
chr1 3000 3500 id-5 8
chr1 3500 4000 id-6 20
chr1 4000 4500 id-7 15
chr1 4500 5000 id-8 12
chr1 5000 5500 id-9 30
chr1 5500 6000 id-10 35
chr1 6000 6500 id-11 20
chr1 6500 7000 id-12 10
chr1 7000 7500 id-13 15
chr1 7500 8000 id-14 12
chr1 8000 8500 id-15 23
chr1 8500 9000 id-16 26
Let's first do a naïve bedmap
with the --mean
operator to see what we get. We'll include the --echo
operator, so we can see what mean signal is associated with what reference element (bin):
$ bedmap --echo --mean bins.bed signal.bed
chr1 1000 1600 1|2.000000
chr1 1600 2200 2|3.500000
chr1 2200 2800 3|4.000000
chr1 2800 3400 4|6.000000
chr1 3400 4000 5|14.000000
chr1 6000 6600 1|15.000000
chr1 6600 7200 2|12.500000
chr1 7200 7800 3|13.500000
chr1 7800 8400 4|17.500000
chr1 8400 9000 5|23.000000
Firstly, the precision of the mean result needs adjusting, using the --prec
operator. Secondly, we'll remove the --echo
operator, because we don't need it for the next step:
$ bedmap --mean --prec 0 bins.bed signal.bed
2
4
4
6
14
15
12
14
18
23
As a first pass, to demonstrate how we'll get a final answer, we'll pass our bedmap
-ed stream of mean values to another awk
statement to track cumulative sums of means over an individual bin, while counting the number of genes we traverse:
$ bedmap --mean --prec 0 bins.bed signal.bed \
| awk ' \
BEGIN { \
binNum=5; \
binCount=0; \
binCumulSums[$binNum]; \
geneCount=1; \
} \
{ \
binIdx=(binCount%binNum)+1; \
binCumulSums[binIdx]+=$0; \
print binIdx"\t"$0"\t"binCumulSums[binIdx]"\t"geneCount; \
binCount++; \
if (binIdx==binNum) { \
geneCount++; \
} \
}' \
- > intermediateResults.tsv
Let's take a look at these results:
$ more intermediateResults.tsv
1 2 2 1
2 4 4 1
3 4 4 1
4 6 6 1
5 14 14 1
1 15 17 2
2 12 16 2
3 14 18 2
4 18 24 2
5 23 37 2
Let's break this table of intermediate values down by column:
- The first column is the bin ID value (
1
through5
). - The second column is the mean signal across that individual bin.
- The third column is the cumulative sum of signal across bins of that bin ID (for example, for bin
4
, the cumulative sum is6
+18
, or24
. - The fourth column is gene index associated the listed bin, irrespective of the gene name. Here, we just have two genes
foo
andbar
(seegenes.bed
), so we have two gene indices1
and2
. (In your case, the total number of genes is 30.2K, so you'd see indices1..30200
.)
We now have enough information to calculate a final answer. We'll remove the print
statement in the awk
command above that shows these intermediate values, and replace it with an END
block that calculates and prints a final result, the average signal over an individual bin:
$ bedmap --mean --prec 0 bins.bed signal.bed \
| awk ' \
BEGIN { \
binNum=5; \
binCount=0; \
binCumulSums[$binNum]; \
geneCount=1; \
} \
{ \
binIdx=(binCount%binNum)+1; \
binCumulSums[binIdx]+=$0; \
binCount++; \
if (binIdx==binNum) { \
geneCount++; \
} \
} \
END { \
for (idx=1;idx<=binNum;idx++) { \
print idx"\t"binCumulSums[idx]/(geneCount-1); \
} \
}' \
- > answer.tsv
Let's look at the answer we'd expect from our sample data:
$ more answer.tsv
1 8.5
2 8
3 9
4 12
5 18.5
How well will this scale to your dataset? I honestly don't know, but since we're streaming through bedmap
results one line at a time, I think the memory overhead is going to be perfectly reasonable on a typical, modern workstation.
If you want to optimize this further, you could even avoid the file I/O expense of creating the intermediate bins.bed
file and just pipe the initial bin-creating awk
statement into bedmap
(as well as taking out the bin index variable, which is unused by bedmap --mean
) e.g.:
$ awk ' \
BEGIN { \
binNum=5; \
} \
{ \
chr=$1; \
start=$2; \
stop=$3; \
binSize=(stop-start)/binNum; \
for (i=start;i<stop;i+=binSize) { \
print chr"\t"i"\t"(i+binSize); \
} \
}' genes.bed \
| bedmap --mean --prec 0 - signal.bed \
| awk ' \
BEGIN { \
binNum=5; \
binCount=0; \
binCumulSums[$binNum]; \
geneCount=1; \
} \
{ \
binIdx=(binCount%binNum)+1; \
binCumulSums[binIdx]+=$0; \
binCount++; \
if (binIdx==binNum) { \
geneCount++; \
} \
} \
END { \
for (idx=1;idx<=binNum;idx++) { \
print idx"\t"binCumulSums[idx]/(geneCount-1); \
} \
}' \
- > probablyFasterAnswer.tsv
Here, we replaced bins.bed
in the bedmap
statement with the hyphen character (-
), which denotes standard input. Eliminating disk-based I/O should make this analysis run even faster. Handling standard input and output in this way is one of the major strengths of BEDOPS tools.
Finally, this is a problem that is one of those "embarrassingly parallel" types, if you think about splitting bins by ID into separate analysis tasks (which I'll leave to you to implement).
sorry, I have a couple of questions. In the genes_bins.bed example file, the gene Xkr4 is divided into 6 bins. Is this correct? Second question: if you want to split each gene into 100 bins, how do you handle the fact that different genes may have different length?
Giovanni, thats just the
head
of the file, I will mention that. Secondly, thats the whole point of binning, the genes have unequal length, if you want to make a Plotting Read Density/ Distribution/ Composite Profiles [Chip-Seq], then the genes should be be binned and avg binding per bin is calculated in the end to be summed up for per bin for all bins of all genes, to get an average binding over the whole length of the gene. This will effect, that sum genes have more density in a bin and some less. Can you suggest some other strategy!!