Question: Compute average score across multiple bed files
1
gravatar for Andy Lee
7 months ago by
Andy Lee10
Andy Lee10 wrote:

I would like to compute the average score across multiple bed files (say from ChIP-seq) for a specific genomic region.

Let's say I have the following three bed files and I wish to compute the average score for chr1:10-20

a.bed

chr start end score
chr1 10 20 3

b.bed

chr start end score
chr1 12 14 3

c.bed

chr start end score
chr1 16 18 3

The desired output that I would like is the following:

chr start end mean.score
chr1 10 11 1
chr1 12 14 2
chr1 15 15 1
chr1 16 18 2
chr1 19 20 1

What is the computationally most efficient way of doing this?

bed python chip-seq R • 404 views
ADD COMMENTlink modified 7 months ago by ATpoint14k • written 7 months ago by Andy Lee10
1

It is not very clear what do you average on (the number of bases, peaks, files)? It seems to me as count and not mean, can you elaborate more how does chr1:10-11 gets score of 1 as average and how chr1:12-14 gets 2?

ADD REPLYlink written 7 months ago by husensofteng50

Hello Andy Lee!

It appears that your post has been cross-posted to another site: https://support.bioconductor.org/p/111405/

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 7 months ago by zx87546.8k
3
gravatar for ATpoint
7 months ago by
ATpoint14k
Germany
ATpoint14k wrote:

First, use bedtools unionbedg. This creates a bedgraph with one score column per sample. Then use awk to get the mean of it. Output is the unionbedgraph with the last column being the mean.The advantage of this code piece over the one from Carlo is that you do not have to specify the number of columns by hand. Credit to Aaron Quinlan. Also, note that the output example you gave does not seem to be correct. Consider to read about the BED format (keyword 0-based coordinate system).

bedtools unionbedg -i a.bed b.bed c.bed | awk '{sum=0; for (col=4; col<=NF; col++) sum += $col; print $0"\t"sum/(NF-4+1); }'
ADD COMMENTlink modified 7 months ago • written 7 months ago by ATpoint14k

Much better than my tinkering !

ADD REPLYlink written 7 months ago by Carlo Yague4.4k
1
gravatar for Carlo Yague
7 months ago by
Carlo Yague4.4k
Belgium
Carlo Yague4.4k wrote:

Use bedtools merge with the -o and -c option.

cat a.bed b.bed c.bed | sort -k1,1 -k2,2n > merged.bed
bedtools merge -c 4 -o mean -i merged.bed

PS: this doesn't seem to be working as the intervals are merged (surprise !) together, regardless of the score. To make it work, I had to create a dummy mapping file then use bedtools map instead. In addition, the 'emtpy' intervals are not taken into account fot the mean calculation, so instead I do the sum and divide by 3.

cat dummy_map.bed
chr1    10  11
chr1    11  12
chr1    12  13
chr1    13  14
chr1    14  15
chr1    16  17
chr1    18  19
chr1    19  20

bedtools map -a dummy_map.bed -b merged.bed -c 4 -o sum | awk 'BEGIN {OFS="\t"};{print $1,$2,$3,$4/3}' 
chr1    10  11  1
chr1    11  12  1
chr1    12  13  2
chr1    13  14  2
chr1    14  15  1
chr1    16  17  2
chr1    18  19  1
chr1    19  20  1
ADD COMMENTlink modified 7 months ago • written 7 months ago by Carlo Yague4.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2502 users visited in the last hour