Compute average score across multiple bed files
2
1
Entering edit mode
4.7 years ago
Andy Lee ▴ 10

I would like to compute the average score across multiple bed files (say from ChIP-seq) for a specific genomic region.

Let's say I have the following three bed files and I wish to compute the average score for chr1:10-20

a.bed

chr start end score
chr1 10 20 3


b.bed

chr start end score
chr1 12 14 3


c.bed

chr start end score
chr1 16 18 3


The desired output that I would like is the following:

chr start end mean.score
chr1 10 11 1
chr1 12 14 2
chr1 15 15 1
chr1 16 18 2
chr1 19 20 1


What is the computationally most efficient way of doing this?

ChIP-Seq R python bed • 4.1k views
1
Entering edit mode

It is not very clear what do you average on (the number of bases, peaks, files)? It seems to me as count and not mean, can you elaborate more how does chr1:10-11 gets score of 1 as average and how chr1:12-14 gets 2?

0
Entering edit mode

Hello Andy Lee!

It appears that your post has been cross-posted to another site: https://support.bioconductor.org/p/111405/

This is typically not recommended as it runs the risk of annoying people in both communities.

6
Entering edit mode
4.7 years ago
ATpoint 70k

First, use bedtools unionbedg. This creates a bedgraph with one score column per sample. Then use awk to get the mean of it. Output is the unionbedgraph with the last column being the mean.The advantage of this code piece over the one from Carlo is that you do not have to specify the number of columns by hand. Credit to Aaron Quinlan. Also, note that the output example you gave does not seem to be correct. Consider to read about the BED format (keyword 0-based coordinate system).

bedtools unionbedg -i a.bed b.bed c.bed | awk '{sum=0; for (col=4; col<=NF; col++) sum += $col; print$0"\t"sum/(NF-4+1); }'


Edit 11/20: If you want a bedGraph as output then use:

awk 'OFS="\t" {sum=0; for (col=4; col<=NF; col++) sum += $col; print$1, $2,$3, sum/(NF-4+1); }'

I personally always use mawk rather than awk as it is much faster.

0
Entering edit mode

Much better than my tinkering !

0
Entering edit mode

My own tweak on this would be to print the average in the score column position as to make the product file consistent with the input BED file column positions.

Thus:

bedtools unionbedg -i <BED> | awk '{sum=0; for (col=4; col<=NF; col++) sum += $col; print$1"\t"$2"\t"$3"\t"sum/(NF-4+1); }' > merged.bed

1
Entering edit mode
4.7 years ago

Use bedtools merge with the -o and -c option.

cat a.bed b.bed c.bed | sort -k1,1 -k2,2n > merged.bed
bedtools merge -c 4 -o mean -i merged.bed


PS: this doesn't seem to be working as the intervals are merged (surprise !) together, regardless of the score. To make it work, I had to create a dummy mapping file then use bedtools map instead. In addition, the 'emtpy' intervals are not taken into account fot the mean calculation, so instead I do the sum and divide by 3.

cat dummy_map.bed
chr1    10  11
chr1    11  12
chr1    12  13
chr1    13  14
chr1    14  15
chr1    16  17
chr1    18  19
chr1    19  20

bedtools map -a dummy_map.bed -b merged.bed -c 4 -o sum | awk 'BEGIN {OFS="\t"};{print $1,$2,$3,$4/3}'
chr1    10  11  1
chr1    11  12  1
chr1    12  13  2
chr1    13  14  2
chr1    14  15  1
chr1    16  17  2
chr1    18  19  1
chr1    19  20  1