False output when intersecting methylation data
4
0
Entering edit mode
4.8 years ago
theodore ▴ 90

Dear all, I came across the following situation: I have three WGBS methylation data from the epigenomics roadmap:

E016_WGBS_FractionalMethylation_allCHROM.bedgraph E013_WGBS_FractionalMethylation_allCHROM.bedgraph E004_WGBS_FractionalMethylation_allCHROM.bedgraph

The values of methylation states on specific sites are:

grep "chrY 5926917" E016_WGBS_FractionalMethylation_allCHROM.bedgraph
chrY    59269171    59269175    0.98 grep "chrY 5926917"
E004_WGBS_FractionalMethylation_allCHROM.bedgraph
chrY    59269171    59269173    0.93 chrY   59269173    59269175    1 grep "chrY
5926917"  E013_WGBS_FractionalMethylation_allCHROM.bedgraph
chrY    59269171    59269173    0.91 chrY   59269173    59269175    0.9


Notice that the site E016 state has a uniformed methylation state so the nucleotides at chrY 59269171 59269173 and chrY 59269173 59269175 are merged into one site which is not true for the rest of the sites. So if I perform multiIntersect:

grep 5926917 E016_E004_E013_WGBS_FractionalMethylation_allCHROM2_multiIntersect.bedgraph
chrY    59269171    59269175    3   1,2,3   1   1   1
chrY    59269175    59269209    0   none    0   0   0


All sites are merged into one If I do subsequent intersects

grep 59269173  E016_E004_E013_WGBS_FractionalMethylation_allCHROM2.bedgraph
chrY    59269171    59269173    0.98    chrY    59269171    59269175    0.93    chrY    59269171    59269173    0.91    2   2


chrY 59269171 59269173 0.98 chrY 59269171 59269175 1 chrY 59269173 59269175 0.9 2 2 chrY 59269173 59269175 0.98 chrY 59269171 59269175 0.93 chrY 59269171 59269173 0.91 2 2 chrY 59269173 59269175 0.98 chrY 59269171 59269175 1 chrY 59269173 59269175 0.9 2 2

The multiintersect does not include the appropriate methylation states (not its function) The subsequent IntersectBed keep the full file but somehow assign the values wrongly, what I would have expect would be:

chrY    59269171    59269173    0.98    chrY    59269171    59269175    0.93    chrY    59269171    59269173    0.91    2   2
chrY    59269173    59269175    0.98    chrY    59269171    59269175    1   chrY    59269173    59269175    0.9 2   2


Is there a way to get the correct intersect output? Thank you.

intersect multiIntersect WGBS • 1.4k views
1
Entering edit mode
4.8 years ago

There is a patch in BEDOPS 2.4.29 that fixes this issue, so that we get items in correct order.

Given three BED5 files, A.bed, B.bed, and C.bed (your epigenome samples, say, preformatted with awk with unique, lexicographically-ordered IDs in the ID field column):

$more A.bed chr1 10468 10470 setA 0.97 chr1 10470 10472 setA 1 chr1 10483 10485 setA 0.95$ more B.bed
chr1    10468   10470   setB    0.79
chr1    10470   10472   setB    0.93
chr1    10483   10485   setB    0.88

$more C.bed chr1 10468 10470 setC 0.91 chr1 10470 10472 setC 0.97 chr1 10483 10485 setC 0.87  We can make a partition and a union of these three sets: $ bedops --partition A.bed B.bed C.bed > partition.bed
$bedops --everything A.bed B.bed C.bed > union.bed  We can then map the scores from union.bed to the elements in partition.bed: $ bedmap --echo --echo-map-score --delim '\t' --multidelim '\t' --prec 2 partition.bed union.bed
chr1    10468   10470   0.97    0.79    0.91
chr1    10470   10472   1.00    0.93    0.97
chr1    10483   10485   0.95    0.88    0.87


However, I don't know when we will put out this revision; it may be a week or two, perhaps longer, as there some other issues I need to resolve with this release that will take time to fix.

You can, if you like, check out the current development version of BEDOPS via:

$git clone https://github.com/bedops/bedops.git$ cd bedops
$git checkout v2p4p29$ make
$make install  Then: $ ./bin/bedops --partition ...
$./bin/bedmap --echo --echo-map-score ...  etc. I'm sorry I don't have a production version ready for you to use, but if this is critical work for you, then the above should hopefully get you going. ADD COMMENT 0 Entering edit mode I am currently installing the git version. Will let you know if everything works as promised ;) Thank you for your time. ADD REPLY 0 Entering edit mode Just a quick note that you have bedgraph files, and you'd need to convert them to BED files with a unique ID that puts the files in desired order (setA == E016, setB == E004, etc.). ADD REPLY 0 Entering edit mode Hey Alex, Now I found that I get some small problems like: The script does not calculate for empty genomic locations as 0 or NA data and when it merges the fifht column it swifts the values so that they will fill the space. head -n 20 /media/trotos/4TB/hESC/Encode_chip_files/methylation/whatTOdo chr1 10468 10470 0.97 0.79 0.91 chr1 10470 10472 1.00 0.93 0.97 chr1 10483 10485 0.95 0.88 0.87 chr1 10488 10490 0.95 0.91 0.94 chr1 10492 10494 0.94 0.91 0.97 chr1 10496 10498 0.91 0.90 0.96 chr1 10524 10526 1.00 0.95 0.99 chr1 10541 10543 0.92 0.98 0.95 chr1 10562 10564 0.96 0.96 0.93 chr1 10570 10572 0.98 0.94 0.91 chr1 10576 10578 0.90 0.91 0.89 chr1 10578 10580 0.92 1.00 0.85 chr1 10588 10590 0.85 0.90 0.88 chr1 10608 10610 0.77 1.00 0.76 chr1 10616 10618 0.67 0.70 chr1 10619 10621 0.55 0.50 chr1 10630 10632 0.25 0.75 chr1 10632 10634 0.10 0.09 chr1 10635 10637 0.38 0.21 chr1 10637 10639 0.21 0.43  And to make things worse: grep "chr1 2956412" /media/trotos/4TB/hESC/Encode_chip_files/methylation/final chr1 29564121 29564123 0.00 0.00 0.02 chr1 29564124 29564126 0.00 0.00 chr1 29564127 29564129 0.00 0.00  And the original files grep "chr1 2956412" /media/trotos/4TB/hESC/Encode_chip_files/methylation/E016_WGBS_FractionalMethylation_allCHROM.bedgraph chr1 29564121 29564123 0 chr1 29564124 29564126 0 chr1 29564127 29564129 0 grep "chr1 2956412" /media/trotos/4TB/hESC/Encode_chip_files/methylation/E004_WGBS_FractionalMethylation_allCHROM.bedgraph grep "chr1 2956412" /media/trotos/4TB/hESC/Encode_chip_files/methylation/E013_WGBS_FractionalMethylation_allCHROM.bedgraph chr1 29564121 29564123 0.02 chr1 29564124 29564126 0 chr1 29564127 29564129 0  ADD REPLY 0 Entering edit mode How can I compensate for the empty areas? ADD REPLY 0 Entering edit mode Please see answer below, for how to fill in gaps. ADD REPLY 1 Entering edit mode 4.8 years ago If you have "holes" in your sets, you need to patch them. Say we have modified our initial three sets: $ more A.bed
chr1    10468   10470   setA    0.97
chr1    10470   10472   setA    1
chr1    10483   10485   setA    0.95

$more B.bed chr1 10468 10470 setB 0.79 chr1 10483 10485 setB 0.88$ more C.bed
chr1    10468   10470   setC    0.91
chr1    10470   10472   setC    0.97
chr1    10483   10485   setC    0.87


Note here that B.bed is missing one of the three elements.

As you note, if we bedmap the partition against the union, we'll get gaps.

Make the partition set as before:

$bedops --partition A.bed B.bed C.bed > partition.bed  For each input set, bedmap that set against the partition, using the --indicator function to mark where there are and are not overlaps. Use this value with awk to post-process output: $ bedmap --indicator --echo --echo-map --delim '\t' partition.bed A.bed | awk -v label='setA' -v OFS='\t' '{ if ($1 == 1) { print$5, $6,$7, $8,$9 } else { print $2,$3, $4, label, "NAN" } }' > A.patched.bed$ bedmap --indicator --echo --echo-map --delim '\t' partition.bed B.bed | awk -v label='setB' -v OFS='\t' '{ if ($1 == 1) { print$5, $6,$7, $8,$9 } else { print $2,$3, $4, label, "NAN" } }' > B.patched.bed$ bedmap --indicator --echo --echo-map --delim '\t' partition.bed C.bed | awk -v label='setC' -v OFS='\t' '{ if ($1 == 1) { print$5, $6,$7, $8,$9 } else { print $2,$3, $4, label, "NAN" } }' > C.patched.bed  (This is easy to script for N arbitrary sets. I'm writing everything out here explicitly so you see how this works.) You could use 0.0 instead of NAN, depending on your experiment and what these values mean in that context. The NAN value is a reserved IEEE 754 floating point value, so it can be used here when it is meaningful to use that value as a marker. Incidentally, this is why bedmap returns nothing when there are no overlaps, as 0.0 and NAN are values that have very specific and different meanings, and it is not really possible for bedmap to guess correctly which value is appropriate to use when the mapping result is a proper empty set. In any case, here is what our patched sets look like: $ more A.patched.bed
chr1    10468   10470   setA    0.97
chr1    10470   10472   setA    1
chr1    10483   10485   setA    0.95

$more B.patched.bed chr1 10468 10470 setB 0.79 chr1 10470 10472 setB NAN chr1 10483 10485 setB 0.88$ more C.patched.bed
chr1    10468   10470   setC    0.91
chr1    10470   10472   setC    0.97
chr1    10483   10485   setC    0.87


From here out, it's the same as before. Take the union of the patched sets:

$bedops --everything *.patched.bed > union.bed  Then map as before: $ bedmap --echo --echo-map-score --delim '\t' --multidelim '\t' --prec 2 partition.bed union.bed
chr1    10468   10470   0.97    0.79    0.91
chr1    10470   10472   1.00    nan     0.97
chr1    10483   10485   0.95    0.88    0.87


(Note: This requires bedmap v2.4.29 or greater.)

0
Entering edit mode

I caved and we put out 2.4.29 earlier than expected. Hopefully this should trickle down to package managers soon, depending on people's schedules, but we have compiled binaries just in case. http://bedops.readthedocs.io/en/latest/content/revision-history.html

0
Entering edit mode

Hi Alex, Thank you for your time. As far as I can see it works well.

0
Entering edit mode
4.8 years ago

You could use BEDOPS bedops --partition to partition the genomic space and bedmap to map the unioned set against the partitioned space.

Abstractly:

$bedops --partition A.bed B.bed C.bed > partitionedABC.bed$ bedops --everything A.bed B.bed C.bed > unionABC.bed
$bedmap --echo --echo-map --delim '\t' partitionedABC.bed unionABC.bed > answer.bed  You could use awk to re-columnize answer.bed to whatever special format you're aiming at. Or, as a one-liner, without intermediate files to slow things down: $ bedmap --echo --echo-map --delim '\t' <(bedops --partition A.bed B.bed C.bed) <(bedops --everything A.bed B.bed C.bed) > answer.bed

0
Entering edit mode

Thank you Alex for the answer. Since I am not familiar with the bedops suit could you please elaborate on the need for performing 3 steps and latter an additional awk? I checked briefly the functions and it is not intuitive hat this would be the best way to get the results.

0
Entering edit mode

Your question seems to suggest you want to partition the input space and then map all your elements to the partitioned space. It might help to read the documentation for bedops and particularly the --partition operator. Then read the portion of the bedmap documentation that relates to --echo-map-* operators.

0
Entering edit mode

Ok, I followed your advice and although the binning is correct I found that the data is somehow scrambled. so, the first lines of the original files are: (NOTE THAT I FOLLOW THE SAME PATTERN FOR THE FILE'S ORDER E016-E004-E013)

trotos@StupidMint:~$head /media/trotos/4TB/hESC/Encode_chip_files/methylation/E016_WGBS_FractionalMethylation_allCHROM.bedgraph chr1 10468 10470 0.97 chr1 10470 10472 1 chr1 10483 10485 0.95 chr1 10488 10490 0.95 chr1 10492 10494 0.94 chr1 10496 10498 0.91 chr1 10524 10526 1 chr1 10541 10543 0.92 chr1 10562 10564 0.96 chr1 10570 10572 0.98 --- 19:13:12 trotos@StupidMint:~$ head /media/trotos/4TB/hESC/Encode_chip_files/methylation/E004_WGBS_FractionalMethylation_allCHROM.bedgraph
chr1    10468   10470   0.79
chr1    10470   10472   0.93
chr1    10483   10485   0.88
chr1    10488   10490   0.91
chr1    10492   10494   0.91
chr1    10496   10498   0.9
chr1    10524   10526   0.95
chr1    10541   10543   0.98
chr1    10562   10564   0.96
chr1    10570   10572   0.94
---  19:13:26
trotos@StupidMint:~$head /media/trotos/4TB/hESC/Encode_chip_files/methylation/E013_WGBS_FractionalMethylation_allCHROM.bedgraph chr1 10468 10470 0.91 chr1 10470 10472 0.97 chr1 10483 10485 0.87 chr1 10488 10490 0.94 chr1 10492 10494 0.97 chr1 10496 10498 0.96 chr1 10524 10526 0.99 chr1 10541 10543 0.95 chr1 10562 10564 0.93 chr1 10570 10572 0.91  and after your pipeline: ./bedops --partition E016_WGBS_FractionalMethylation_allCHROM.bedgraph E004_WGBS_FractionalMethylation_allCHROM.bedgraph E013_WGBS_FractionalMethylation_allCHROM.bedgraph > /partitionedBEDOPS ./bedops --everything E016_WGBS_FractionalMethylation_allCHROM.bedgraph E004_WGBS_FractionalMethylation_allCHROM.bedgraph E013_WGBS_FractionalMethylation_allCHROM.bedgraph > everything ./bedmap --echo --echo-map --delim '\t' partitionedBEDOPS everything > whatTOdo  with some sed-awk to clear the output: sed 's/\;/\t/g' /media/trotos/4TB/hESC/Encode_chip_files/methylation/whatTOdo | awk '{print$1"\t"$2"\t"$3"\t"$7"\t"$11"\t"\$15}' | head
chr1    10468   10470   0.79    0.91    0.97
chr1    10470   10472   1   0.93    0.97
chr1    10483   10485   0.87    0.88    0.95
chr1    10488   10490   0.95    0.91    0.94
chr1    10492   10494   0.91    0.94    0.97
chr1    10496   10498   0.96    0.9 0.91
chr1    10524   10526   0.95    0.99    1
chr1    10541   10543   0.98    0.92    0.95
chr1    10562   10564   0.93    0.96    0.96
chr1    10570   10572   0.98    0.91    0.94


As you can see the values are completely messed up. The first line should have been: chr1 10468 10470 0.97 0.79 0.91 the second line does not even follow the order of the first line.

0
Entering edit mode

I think there's a way to do this that is not pretty, but prepends the ID field in a way that provides results to resort lexicographically, so that map results can be printed in a consistent order. I'll check to see if there is a smarter way to do this and, either way, I'll probably follow up with an answer tomorrow or Tuesday.

0
Entering edit mode
4.8 years ago
theodore ▴ 90

I will wait patiently. Thank you for your efforts.