Question: Compute average score across multiple bed files
1
gravatar for Andy Lee
2.0 years ago by
Andy Lee10
Andy Lee10 wrote:

I would like to compute the average score across multiple bed files (say from ChIP-seq) for a specific genomic region.

Let's say I have the following three bed files and I wish to compute the average score for chr1:10-20

a.bed

chr start end score
chr1 10 20 3

b.bed

chr start end score
chr1 12 14 3

c.bed

chr start end score
chr1 16 18 3

The desired output that I would like is the following:

chr start end mean.score
chr1 10 11 1
chr1 12 14 2
chr1 15 15 1
chr1 16 18 2
chr1 19 20 1

What is the computationally most efficient way of doing this?

bed python chip-seq R • 1.5k views
ADD COMMENTlink modified 2.0 years ago by ATpoint36k • written 2.0 years ago by Andy Lee10
1

It is not very clear what do you average on (the number of bases, peaks, files)? It seems to me as count and not mean, can you elaborate more how does chr1:10-11 gets score of 1 as average and how chr1:12-14 gets 2?

ADD REPLYlink written 2.0 years ago by husensofteng260

Hello Andy Lee!

It appears that your post has been cross-posted to another site: https://support.bioconductor.org/p/111405/

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 2.0 years ago by zx87549.4k
3
gravatar for ATpoint
2.0 years ago by
ATpoint36k
Germany
ATpoint36k wrote:

First, use bedtools unionbedg. This creates a bedgraph with one score column per sample. Then use awk to get the mean of it. Output is the unionbedgraph with the last column being the mean.The advantage of this code piece over the one from Carlo is that you do not have to specify the number of columns by hand. Credit to Aaron Quinlan. Also, note that the output example you gave does not seem to be correct. Consider to read about the BED format (keyword 0-based coordinate system).

bedtools unionbedg -i a.bed b.bed c.bed | awk '{sum=0; for (col=4; col<=NF; col++) sum += $col; print $0"\t"sum/(NF-4+1); }'
ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by ATpoint36k

Much better than my tinkering !

ADD REPLYlink written 2.0 years ago by Carlo Yague5.0k

My own tweak on this would be to print the average in the score column position as to make the product file consistent with the input BED file column positions.

Thus:

bedtools unionbedg -i <BED> | awk '{sum=0; for (col=4; col<=NF; col++) sum += $col; print $1"\t"$2"\t"$3"\t"sum/(NF-4+1); }' > merged.bed
ADD REPLYlink modified 3 months ago • written 3 months ago by alexandercmonovich0
1
gravatar for Carlo Yague
2.0 years ago by
Carlo Yague5.0k
Canada
Carlo Yague5.0k wrote:

Use bedtools merge with the -o and -c option.

cat a.bed b.bed c.bed | sort -k1,1 -k2,2n > merged.bed
bedtools merge -c 4 -o mean -i merged.bed

PS: this doesn't seem to be working as the intervals are merged (surprise !) together, regardless of the score. To make it work, I had to create a dummy mapping file then use bedtools map instead. In addition, the 'emtpy' intervals are not taken into account fot the mean calculation, so instead I do the sum and divide by 3.

cat dummy_map.bed
chr1    10  11
chr1    11  12
chr1    12  13
chr1    13  14
chr1    14  15
chr1    16  17
chr1    18  19
chr1    19  20

bedtools map -a dummy_map.bed -b merged.bed -c 4 -o sum | awk 'BEGIN {OFS="\t"};{print $1,$2,$3,$4/3}' 
chr1    10  11  1
chr1    11  12  1
chr1    12  13  2
chr1    13  14  2
chr1    14  15  1
chr1    16  17  2
chr1    18  19  1
chr1    19  20  1
ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by Carlo Yague5.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1682 users visited in the last hour