Question: Compute average score across multiple bed files
1
2.5 years ago by
Andy Lee10
Andy Lee10 wrote:

I would like to compute the average score across multiple bed files (say from ChIP-seq) for a specific genomic region.

Let's say I have the following three bed files and I wish to compute the average score for chr1:10-20

a.bed

``````chr start end score
chr1 10 20 3
``````

b.bed

``````chr start end score
chr1 12 14 3
``````

c.bed

``````chr start end score
chr1 16 18 3
``````

The desired output that I would like is the following:

``````chr start end mean.score
chr1 10 11 1
chr1 12 14 2
chr1 15 15 1
chr1 16 18 2
chr1 19 20 1
``````

What is the computationally most efficient way of doing this?

bed python chip-seq R • 1.9k views
modified 2.5 years ago by ATpoint44k • written 2.5 years ago by Andy Lee10
1

It is not very clear what do you average on (the number of bases, peaks, files)? It seems to me as count and not mean, can you elaborate more how does chr1:10-11 gets score of 1 as average and how chr1:12-14 gets 2?

Hello Andy Lee!

It appears that your post has been cross-posted to another site: https://support.bioconductor.org/p/111405/

This is typically not recommended as it runs the risk of annoying people in both communities.

3
2.5 years ago by
ATpoint44k
ATpoint44k wrote:

First, use `bedtools unionbedg`. This creates a bedgraph with one score column per sample. Then use awk to get the mean of it. Output is the unionbedgraph with the last column being the mean.The advantage of this code piece over the one from Carlo is that you do not have to specify the number of columns by hand. Credit to Aaron Quinlan. Also, note that the output example you gave does not seem to be correct. Consider to read about the BED format (keyword 0-based coordinate system).

``````bedtools unionbedg -i a.bed b.bed c.bed | awk '{sum=0; for (col=4; col<=NF; col++) sum += \$col; print \$0"\t"sum/(NF-4+1); }'
``````

Edit 11/20: If you want a bedGraph as output then use:

`awk 'OFS="\t" {sum=0; for (col=4; col<=NF; col++) sum += \$col; print \$1, \$2, \$3, sum/(NF-4+1); }'`

I personally always use `mawk` rather than `awk` as it is much faster.

Much better than my tinkering !

My own tweak on this would be to print the average in the score column position as to make the product file consistent with the input BED file column positions.

Thus:

``````bedtools unionbedg -i <BED> | awk '{sum=0; for (col=4; col<=NF; col++) sum += \$col; print \$1"\t"\$2"\t"\$3"\t"sum/(NF-4+1); }' > merged.bed
``````
1
2.5 years ago by
Carlo Yague5.5k
Carlo Yague5.5k wrote:

Use bedtools merge with the `-o` and `-c` option.

``````cat a.bed b.bed c.bed | sort -k1,1 -k2,2n > merged.bed
bedtools merge -c 4 -o mean -i merged.bed
``````

PS: this doesn't seem to be working as the intervals are merged (surprise !) together, regardless of the score. To make it work, I had to create a dummy mapping file then use bedtools map instead. In addition, the 'emtpy' intervals are not taken into account fot the mean calculation, so instead I do the sum and divide by 3.

``````cat dummy_map.bed
chr1    10  11
chr1    11  12
chr1    12  13
chr1    13  14
chr1    14  15
chr1    16  17
chr1    18  19
chr1    19  20

bedtools map -a dummy_map.bed -b merged.bed -c 4 -o sum | awk 'BEGIN {OFS="\t"};{print \$1,\$2,\$3,\$4/3}'
chr1    10  11  1
chr1    11  12  1
chr1    12  13  2
chr1    13  14  2
chr1    14  15  1
chr1    16  17  2
chr1    18  19  1
chr1    19  20  1
``````