Question: How To Extract Scores From Bedgraph File Using Bed Tools
3
gravatar for biorepine
6.6 years ago by
biorepine1.4k
Spain
biorepine1.4k wrote:

file1

chr1 10 20 name 0 +

file2

chr1 12 14 2.5
chr1 14 15 0.5

How could i extract average scores of file1 using file2, like below? I am trying to extract phastcons (file2) average scores of file1.

chr1  10 20 name 0 + 1.5
bedtools • 8.1k views
ADD COMMENTlink modified 5 months ago by brianpenghe20 • written 6.6 years ago by biorepine1.4k
8
gravatar for Aaronquinlan
6.6 years ago by
Aaronquinlan11k
United States
Aaronquinlan11k wrote:

The bedtools map function does this. Note that it requires both files to be position-sorted (e.g., sort -k1,1 -k2,2n). To calculate the mean of the 4th (score) column of the file2, you'd use:

$ bedtools map -a file1.sorted.bed -b file2.sorted.bedg -c 4 -o mean
chr1    10    20    name    0    +    1.5

There are other operations besides mean (snippet from bedtools map -h):

-o    Specify the operation that should be applied to -c.
     Valid operations:
        sum, min, max,
        mean, median,
        collapse (i.e., print a comma separated list (duplicates allowed)), 
        distinct (i.e., print a comma separated list (NO duplicates allowed)), 
        count
        count_distinct (i.e., a count of the unique values in the column), 
    - Default: sum
ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by Aaronquinlan11k
1

A related question: in the example above, the value reported by bedtools map is not actually the mean across the region interrogated. In other words, it doesn't take into account the interval from the bedgraph file, while I would like to get the mean per nucleotide. Here, from position 10 to 19 on chr1, the mean per nucleotide should be (2X2.5 + 1X1.5 + 7X0)/10 = 0.65 Is there a way to calculate mean this way using bedtools map? Thanks!

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by julien.roux100
1

Use bedmap - see Alex's post below - but with the --wmean operator. That will weight each score in the bedgraph file by the length of the interval.

ADD REPLYlink written 23 months ago by josh10

+1 Even though I have answered your question I would go with bedtools! I didn't realise it had this functionality.

ADD REPLYlink written 6.6 years ago by Ian5.5k

Is it possible to print the output to a file? Standard shell scripting commands returns an empty file.

ADD REPLYlink written 4.2 years ago by Sandeep0
5
gravatar for Alex Reynolds
6.6 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Another way to do this is with BEDOPS bedmap, but first, we need to add IDs to file2 to make it a "true" BED file — that is, scores/signals should be in the fifth column, as per UCSC specification.

One way to do this is with awk, for example:

$ awk 'BEGIN{id=0} {print $1"\t"$2"\t"$3"\tid-"id"\t"$4; id++;}' file2 > file2_with_ids.bed

Also, make sure your inputs are sorted, e.g. with the sort-bed utility:

$ sort-bed file1 > file1_sorted.bed
$ sort-bed file2_with_ids.bed > file2_with_ids_sorted.bed

Then we apply bedmap with the --mean operator:

$ bedmap --echo --mean --delim "\t" --prec 1 file1_sorted.bed file2_with_ids_sorted.bed > answer.bed

We use the --echo operator to first print the reference element from file1_sorted.bed, adding --mean to print the mean score of mapped elements from file2_with_ids_sorted.bed, which overlap the reference element from file1.bed by one or more bases.

We then use the --delim "\t" setting so that the delimiter between reference element and mean score is a tab character. The --prec 1 option specifies the precision of the mean signal to one decimal place.

Results are sent to the file answer.bed, which is (thanks to the tab delimiter) just another BED file:

$ more answer.bed
chr1    10  20  name    0   +   1.5

Other numerical and ID-based operators are available to bedmap, e.g. --median, --stdev, etc. as well as custom overlap criteria. Refer to the documentation for more details and sample calculations.

ADD COMMENTlink modified 6.4 years ago • written 6.6 years ago by Alex Reynolds28k
1

Thank you very much, i have to say this is the best way of doing it, I have used it for annotate regions of Chip-seq with PhasConsScores, this method its the best for building score annotations. Just want to say it if someone else need it later.

ADD REPLYlink written 19 months ago by carlosalfonsogonzalez610
1
gravatar for Ian
6.6 years ago by
Ian5.5k
University of Manchester, UK
Ian5.5k wrote:

1) convert file2 into bigWig format, using 'wigToBigWig'.

wigToBigWig file2 chrom.sizes file2.bw

2) use 'bigWigSummary' file1 and file2.bw

bigWigSummary -type=mean file2.bw chr1 10 20 1 > output

Programs can be obtained from HERE.

You will need to write a script to automatically read multiple entries from file1 into 'bigWigSummary'.

ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by Ian5.5k
0
gravatar for brianpenghe
5 months ago by
brianpenghe20
United States
brianpenghe20 wrote:

There are so many issues in the methods suggested above. A. different regions have different lengths. B. you don't know whether the "score" of inputs are already average scores of the region or the sum C. empty regions handling

I'd suggest converting the bedgraph files into bigWigs and then using DeepTools' multibigwigsummary: https://deeptools.readthedocs.io/en/latest/content/tools/multiBigwigSummary.html#BED-file

ADD COMMENTlink modified 5 months ago • written 5 months ago by brianpenghe20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 729 users visited in the last hour