Question: merge sites of BEDgraph file and calculate coverage weighted average of collapsed scores
0
4.7 years ago by
United States
reilly.brian.m60 wrote:

EDIT: I wrote an R script to do what I want, but I feel like there must be a better/faster way.  I added the skeleton/logic of the script is at the bottom of the post, so if you have any ideas on how to make it better, or alternate methods, please let me know.

I have a BEDgraph type data.table with current coordinates of length = 1.  I want to collapse coordinates that are located within 500 bp of one another on the same chromosome IF the the sign of the contents of the "meth.diff" column are equal.   Additionally I would like to calculate the coverage weighted average of the mean_KO/mean_WT columns with the weighting given in the coverage_KO/coverage_WT columns for rows that are collapsed.

Here is an example of my input data:

``````library(data.table)
dt = structure(list(chr = c("chr1", "chr1", "chr1", "chr1", "chrX",
"chrX", "chrX", "chrX"), start = c(842326, 855423, 855426, 855739,
153880833, 153880841, 154298086, 154298089), end = c(842327L,
855424L, 855427L, 855740L, 153880834L, 153880842L, 154298087L,
154298090L), meth.diff = c(9.35200555410902, 19.1839617944039,
29.6734426495636, -12.3375577709254, 50.5830043986142, 52.7503561092491,
46.5783738475184, 41.8662800742733), mean_KO = c(9.35200555410902,
19.1839617944039, 32.962962583692, 1.8512250859083, 51.2741224212646,
53.0928367727283, 47.4901932463221, 44.8441659366298), mean_WT = c(0,
0, 3.28951993412841, 14.1887828568337, 0.69111802265039, 0.34248066347919,
0.91181939880374, 2.97788586235646), coverage_KO = c(139L, 55L,
55L, 270L, 195L, 194L, 131L, 131L), coverage_WT = c(120L, 86L,
87L, 444L, 291L, 293L, 181L, 181L)), .Names = c("chr", "start",
"end", "meth.diff", "mean_KO", "mean_WT", "coverage_KO", "coverage_WT"
), class = c("data.table", "data.frame"), row.names = c(NA, -8L
))``````

What I would like the output to look like (except collapsed row means would be calculated and not in string form):

``````library(data.table)
dt_output = structure(list(chr = c("chr1", "chr1", "chr1", "chrX", "chrX"
), start = c(842326, 855423, 855739, 153880833, 154298086), end = c(842327,
855427, 855740, 153880842, 154298090), mean_1 = c("9.35", "((19.18*55)/(55+55)) + ((32.96*55)/(55+55))",
"1.85", "((51.27*195)/(195+194)) + ((53.09*194)/(195+194))",
"((47.49*131)/(131+131)) + ((44.84*131)/(131+131))"), mean_2 = c("0",
"((0.00*86)/(86+87)) + ((3.29*87)/(86+87))", "14.19", "((0.69*291)/(291+293)) + ((0.34*293)/(291+293))",
"((0.91*181)/(181+181)) + ((2.98*181)/(181+181))")), .Names = c("chr",
"start", "end", "mean_1", "mean_2"), row.names = c(NA, -5L), class = c("data.table", "data.frame"))``````

If my request is still unclear, here it is stated explicitly:

1st: I would like to collapse adjacent rows into larger start/end ranges based on the following criteria:

1. The two adjacent rows share the same value for the "chr" column (row 1 "chr" = chr1, and row 2 "chr" = chr1)
2. The two adjacent rows have "start" (or "end") coordinate within 500 of one another (if row 1 "start" = 1000, and row 2 "start" <= 1499, collapse these into a single row; if row1 = 1000 and row2 = 1500, keep separate)
3. The adjacent rows must have the same sign for the "diff" column (i.e. even if chr = chr and start within 500, if diff1 = + 5 and diff2 = -5, keep entries separate)

2nd: I would like to calculate the coverage_ weighted averages of the collapsed mean_KO/WT columns with the weighting by the coverage_KO/WT columns:

Ex: collapse 2 rows,

row 1 mean_WT = 5.0, coverage_WT = 20.

row 2 mean_WT =40.0, coverage_WT = 45.

weighted avg mean_WT = (((5.0*20)/(20+45)) + ((40.0*45)/(20+45))) = 29.23

Help with either part 1 or 2 or any advice is appreciated.

I have been using R for most of my data manipulations, but I am open to any language that can provide a solution. Thanks in advance.

EDIT: R script logic skeleton:

1. read file row by row (WHILE n <= length(file))
2. IF not on last row of file:
1. IF row n and n + 1 within 500bp  & meth.diff has same sign
1. --> merge sites into a region, calculate coverage weighted avg, save as a single row data.table, and n = n+2 (skip next row (n+1), it was merged)
2. ELSE:
1. --> save site as a single row data.table, do not calculate new mean, and n = n +1 (go to next line, it was not merged)
3. IF on last row of file:
1. IF row n,n -1 (within 500 & meth.diff has same sign=TRUE) AND row n-1,n-2 (within 500 & meth.diff same sign=TRUE)
1. --> save row as a single row data.table, and n = n+1 (this row was not merged because the two rows before were merged and 2 was added to n, end this round of program, this site will be merged with region below in next round.)
2. IF row n,n -1 (within 500 & meth.diff has same sign=TRUE) AND row n-1,n-2 (within 500 & meth.diff same sign=FALSE)
1. --> merge rows, save as a single row data.table, and n = n+1 ( sites n-1/n-2 weren't merged, and n/n-1 should be merged, so merge into a region, calculate coverage weighted avg)
3. IF row n,n -1 (within 500 & meth.diff has same sign=FALSE)
1. --> save row as a single row data.table, and n = n+1 (This site should not be merged with site below, so save as is)
4.  Concatenate the data.table rows created for merged/unmerged sites, and feed this new data.table back into the function until successive iterations produce an identical result.

data.table bed merge R bedtools • 2.3k views
modified 4.7 years ago by Alex Reynolds30k • written 4.7 years ago by reilly.brian.m60
2
4.7 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

Thanks Alex! This step-by-step was exactly what I needed to get it to work with BEDOPS. I should be able to implement this now.

1
4.7 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

You could do this on the command line with awk and BEDOPS bedmap.

First, preprocess your Bedgraph files by normalizing by the coverage. This can be done in R, since you have the coverage vector there. Preprocessing gets the scores ready for later use.

Then separate preprocessed score elements by their score sign and turn them into sorted BED files.

Say your preprocessed elements are in a file called `elements.bedgraph`:

``````\$ awk '\$4 < 0' elements.bedgraph | awk '{ print \$1"\t"\$2"\t"\$3"\t.\t"\$4; }' | sort-bed - > elements.neg.bed
\$ awk '\$4 >= 0' elements.bedgraph | awk '{ print \$1"\t"\$2"\t"\$3"\t.\t"\$4; }' | sort-bed - > elements.pos.bed
``````

Second, apply a bedmap operation on each signed subset. Pad elements by 500 bp with the option `--range 500`. Elements that are between 1 and 500 bases apart will associate. If you want the mean of the preprocessed score data over the mapped regions, add the `--mean` operator:

``````\$ bedmap --range 500 --echo --mean --delim '\t' elements.pos.bed > elements.pos.pad500ntMapAndMean.bed
``````

This shows each element of `elements.pos.bed` and, for each element, the mean of preprocessed scores of all other elements in this set that overlap within 500 bases.

Hey Alex, thanks for your help!

Your solution almost works for my situation. The issue here is that if I create two separate bed files, one with positive values and one with negative values, then when I merge sites within 500bp for the positive bedgraph file, I could be merging over a span where there was a site in the negative bedgraph file. I would like to only merge the sites that contiguously had the same sign in the score file.

For example, given the three sites below, I would like to not merge the two positive sites even though they are within 500bp, because there is a negative site in between them.

``````chr1 234 235 +34
chr1 239 240 -23
chr1 250 251 +26
``````

Do you see what I mean?

1

With some more set operations, I think you could get to your answer.

You could merge positive elements within 500 bp, and do the same with negative elements within 500 bp:

``````\$ bedmap --range 500 --echo-map --multidelim '\n' elements.pos.bed | sort-bed - > posMap.bed
\$ bedmap --range 500 --echo-map --multidelim '\n' elements.neg.bed | sort-bed - > negMap.bed
``````

We pipe the result to `sort-bed` in both cases, because in this special case, the mapped BED elements coming out of the bedmap step are not guaranteed to be in sorted order.

We have two sets called `posMap.bed` and `negMap.bed`.

Within one set, each element of that set contains elements that are within 500 bp of at least one other element in that set. This is the case of the `posMap` set and also of the `negMap` set.

Next, we can pad each element of `posMap.bed` by 500 bases, and filter them against the elements in the original negative set `elements.neg.bed` using bedops --not-element-of:

``````\$ bedops --everything --range 500 posMap.bed | bedops --not-element-of 1 - elements.neg.bed | bedops --everything --range -500 - > filtered.posMap.bed
``````

We can do the same for `negMap.bed`:

``````\$ bedops --everything --range 500 negMap.bed | bedops --not-element-of 1 - elements.pos.bed | bedops --everything --range -500 - > filtered.negMap.bed
``````

With this series of operations, a positive-filtered set should not overlap any elements in the negative-score set within 500 bases, and likewise for the negative-filtered set.

Once you have these filtered elements, you can re-map to get all (positive or negative) elements that map within 500 bases, which exclude (negative or positive, resp.) elements inside that 500 base window, and take the mean of their scores:

``````\$ bedmap --range 500 --echo --mean --delim '\t' filtered.posMap.bed > filtered.posMap.withMeanOfElementsWithin500Bases.bed
``````

And do the same for negative-filtered elements (if needed):

``````\$ bedmap --range 500 --echo --mean --delim '\t' filtered.negMap.bed > filtered.negMap.withMeanOfElementsWithin500Bases.bed
``````

I think this should work to exclude the elements in your scenario.

I am not really familiar with BEDOPS (other than hearing of it before), I usually use bedtools, which from what I can gather is somewhat similar. I will take a look at BEDOPS now to see if your strategy will work for me.

In the mean time let me just ask you this: Using the pipeline/strategy you laid out, do you think it would be feasible to calculate the means of merged sites, weighted by their unmerged value in the `coverage_` column? This is in regards to part 2 of my question. I would like to calculate the coverage weighted means of the merged areas.

I ended up writing an R script to do this for me in the mean time, but it is really ugly and I'm afraid it is going to be very slow, so I am interested in trying to get BEDOPS to work so I don't waste time when I process more files.

That was what I meant by preprocessing the scores, since you have weights for each region. You could do this in R, then `write.table()` to write out BED files. Once you have done the set operations, you are then calculating the mean of preprocessed (pre-weighted) scores. If you're more familiar with bedtools, you could use that instead, assuming it does the same set and map operations I outlined above. I'm not really familiar with it so I can't speak to its use here, but there are lots of bedtools people here, so I'm sure someone who knows could comment.

I am not exactly sure what you mean by "preprocess the scores"? I am trying to calculate the scores based on whether they merge or not, so how could I pre-calculate if I don't yet know what the overall merged region will contain?

For example, if there are 3 sites that are merged:

``````Coverage weighted mean = (val1 * coverage1 + val2* coverage2 + val3*coverage3) / (coverage1+2+3)
``````

I could calculate the weight for a site (i.e. `val1*coverage1`), but I must know how many and which sites will merge in order to calculate the denominator for the equation right? Or am I missing something simple here?

Thanks again for your help. By the way I've started looking into BEDOPS and now I definitely plan on using this for some of my downstream analyses, so thanks for pointing me in that direction.