Question: False output when intersecting methylation data
0
gravatar for theodore
23 months ago by
theodore40
Germany
theodore40 wrote:

Dear all, I came across the following situation: I have three WGBS methylation data from the epigenomics roadmap:

E016_WGBS_FractionalMethylation_allCHROM.bedgraph E013_WGBS_FractionalMethylation_allCHROM.bedgraph E004_WGBS_FractionalMethylation_allCHROM.bedgraph

The values of methylation states on specific sites are:

grep "chrY 5926917" E016_WGBS_FractionalMethylation_allCHROM.bedgraph
    chrY    59269171    59269175    0.98 grep "chrY 5926917"
E004_WGBS_FractionalMethylation_allCHROM.bedgraph
    chrY    59269171    59269173    0.93 chrY   59269173    59269175    1 grep "chrY
5926917"  E013_WGBS_FractionalMethylation_allCHROM.bedgraph
    chrY    59269171    59269173    0.91 chrY   59269173    59269175    0.9

Notice that the site E016 state has a uniformed methylation state so the nucleotides at chrY 59269171 59269173 and chrY 59269173 59269175 are merged into one site which is not true for the rest of the sites. So if I perform multiIntersect:

grep 5926917 E016_E004_E013_WGBS_FractionalMethylation_allCHROM2_multiIntersect.bedgraph
    chrY    59269171    59269175    3   1,2,3   1   1   1
    chrY    59269175    59269209    0   none    0   0   0

All sites are merged into one If I do subsequent intersects

grep 59269173  E016_E004_E013_WGBS_FractionalMethylation_allCHROM2.bedgraph
    chrY    59269171    59269173    0.98    chrY    59269171    59269175    0.93    chrY    59269171    59269173    0.91    2   2

chrY 59269171 59269173 0.98 chrY 59269171 59269175 1 chrY 59269173 59269175 0.9 2 2 chrY 59269173 59269175 0.98 chrY 59269171 59269175 0.93 chrY 59269171 59269173 0.91 2 2 chrY 59269173 59269175 0.98 chrY 59269171 59269175 1 chrY 59269173 59269175 0.9 2 2

The multiintersect does not include the appropriate methylation states (not its function) The subsequent IntersectBed keep the full file but somehow assign the values wrongly, what I would have expect would be:

chrY    59269171    59269173    0.98    chrY    59269171    59269175    0.93    chrY    59269171    59269173    0.91    2   2
chrY    59269173    59269175    0.98    chrY    59269171    59269175    1   chrY    59269173    59269175    0.9 2   2

Is there a way to get the correct intersect output? Thank you.

wgbs intersect multiintersect • 719 views
ADD COMMENTlink modified 23 months ago by Alex Reynolds28k • written 23 months ago by theodore40
1
gravatar for Alex Reynolds
23 months ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

There is a patch in BEDOPS 2.4.29 that fixes this issue, so that we get items in correct order.

Given three BED5 files, A.bed, B.bed, and C.bed (your epigenome samples, say, preformatted with awk with unique, lexicographically-ordered IDs in the ID field column):

$ more A.bed
chr1    10468   10470   setA    0.97
chr1    10470   10472   setA    1
chr1    10483   10485   setA    0.95

$ more B.bed
chr1    10468   10470   setB    0.79
chr1    10470   10472   setB    0.93
chr1    10483   10485   setB    0.88

$ more C.bed
chr1    10468   10470   setC    0.91
chr1    10470   10472   setC    0.97
chr1    10483   10485   setC    0.87

We can make a partition and a union of these three sets:

$ bedops --partition A.bed B.bed C.bed > partition.bed
$ bedops --everything A.bed B.bed C.bed > union.bed

We can then map the scores from union.bed to the elements in partition.bed:

$ bedmap --echo --echo-map-score --delim '\t' --multidelim '\t' --prec 2 partition.bed union.bed
chr1    10468   10470   0.97    0.79    0.91
chr1    10470   10472   1.00    0.93    0.97    
chr1    10483   10485   0.95    0.88    0.87

However, I don't know when we will put out this revision; it may be a week or two, perhaps longer, as there some other issues I need to resolve with this release that will take time to fix.

You can, if you like, check out the current development version of BEDOPS via:

$ git clone https://github.com/bedops/bedops.git
$ cd bedops
$ git checkout v2p4p29
$ make
$ make install

Then:

$ ./bin/bedops --partition ...
$ ./bin/bedmap --echo --echo-map-score ...

etc.

I'm sorry I don't have a production version ready for you to use, but if this is critical work for you, then the above should hopefully get you going.

ADD COMMENTlink modified 23 months ago • written 23 months ago by Alex Reynolds28k

I am currently installing the git version. Will let you know if everything works as promised ;) Thank you for your time.

ADD REPLYlink written 23 months ago by theodore40

Just a quick note that you have bedgraph files, and you'd need to convert them to BED files with a unique ID that puts the files in desired order (setA == E016, setB == E004, etc.).

ADD REPLYlink written 23 months ago by Alex Reynolds28k

Hey Alex, Now I found that I get some small problems like: The script does not calculate for empty genomic locations as 0 or NA data and when it merges the fifht column it swifts the values so that they will fill the space.

head -n 20 /media/trotos/4TB/hESC/Encode_chip_files/methylation/whatTOdo
chr1    10468   10470   0.97    0.79    0.91
chr1    10470   10472   1.00    0.93    0.97
chr1    10483   10485   0.95    0.88    0.87
chr1    10488   10490   0.95    0.91    0.94
chr1    10492   10494   0.94    0.91    0.97
chr1    10496   10498   0.91    0.90    0.96
chr1    10524   10526   1.00    0.95    0.99
chr1    10541   10543   0.92    0.98    0.95
chr1    10562   10564   0.96    0.96    0.93
chr1    10570   10572   0.98    0.94    0.91
chr1    10576   10578   0.90    0.91    0.89
chr1    10578   10580   0.92    1.00    0.85
chr1    10588   10590   0.85    0.90    0.88
chr1    10608   10610   0.77    1.00    0.76
chr1    10616   10618   0.67    0.70
chr1    10619   10621   0.55    0.50
chr1    10630   10632   0.25    0.75
chr1    10632   10634   0.10    0.09
chr1    10635   10637   0.38    0.21
chr1    10637   10639   0.21    0.43

And to make things worse:

grep "chr1   2956412" /media/trotos/4TB/hESC/Encode_chip_files/methylation/final
chr1    29564121    29564123    0.00    0.00    0.02
chr1    29564124    29564126    0.00    0.00
chr1    29564127    29564129    0.00    0.00

And the original files

grep "chr1   2956412" /media/trotos/4TB/hESC/Encode_chip_files/methylation/E016_WGBS_FractionalMethylation_allCHROM.bedgraph
chr1    29564121    29564123    0
chr1    29564124    29564126    0
chr1    29564127    29564129    0

grep "chr1   2956412" /media/trotos/4TB/hESC/Encode_chip_files/methylation/E004_WGBS_FractionalMethylation_allCHROM.bedgraph

grep "chr1   2956412" /media/trotos/4TB/hESC/Encode_chip_files/methylation/E013_WGBS_FractionalMethylation_allCHROM.bedgraph
    chr1    29564121    29564123    0.02
    chr1    29564124    29564126    0
    chr1    29564127    29564129    0
ADD REPLYlink modified 23 months ago • written 23 months ago by theodore40

How can I compensate for the empty areas?

ADD REPLYlink written 23 months ago by theodore40

Please see answer below, for how to fill in gaps.

ADD REPLYlink written 23 months ago by Alex Reynolds28k
1
gravatar for Alex Reynolds
23 months ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

If you have "holes" in your sets, you need to patch them.

Say we have modified our initial three sets:

$ more A.bed
chr1    10468   10470   setA    0.97
chr1    10470   10472   setA    1
chr1    10483   10485   setA    0.95

$ more B.bed
chr1    10468   10470   setB    0.79
chr1    10483   10485   setB    0.88

$ more C.bed
chr1    10468   10470   setC    0.91
chr1    10470   10472   setC    0.97
chr1    10483   10485   setC    0.87

Note here that B.bed is missing one of the three elements.

As you note, if we bedmap the partition against the union, we'll get gaps.

Make the partition set as before:

$ bedops --partition A.bed B.bed C.bed > partition.bed

For each input set, bedmap that set against the partition, using the --indicator function to mark where there are and are not overlaps. Use this value with awk to post-process output:

$ bedmap --indicator --echo --echo-map --delim '\t' partition.bed A.bed | awk -v label='setA' -v OFS='\t' '{ if ($1 == 1) { print $5, $6, $7, $8, $9 } else { print $2, $3, $4, label, "NAN" } }' > A.patched.bed

$ bedmap --indicator --echo --echo-map --delim '\t' partition.bed B.bed | awk -v label='setB' -v OFS='\t' '{ if ($1 == 1) { print $5, $6, $7, $8, $9 } else { print $2, $3, $4, label, "NAN" } }' > B.patched.bed

$ bedmap --indicator --echo --echo-map --delim '\t' partition.bed C.bed | awk -v label='setC' -v OFS='\t' '{ if ($1 == 1) { print $5, $6, $7, $8, $9 } else { print $2, $3, $4, label, "NAN" } }' > C.patched.bed

(This is easy to script for N arbitrary sets. I'm writing everything out here explicitly so you see how this works.)

You could use 0.0 instead of NAN, depending on your experiment and what these values mean in that context. The NAN value is a reserved IEEE 754 floating point value, so it can be used here when it is meaningful to use that value as a marker.

Incidentally, this is why bedmap returns nothing when there are no overlaps, as 0.0 and NAN are values that have very specific and different meanings, and it is not really possible for bedmap to guess correctly which value is appropriate to use when the mapping result is a proper empty set.

In any case, here is what our patched sets look like:

$ more A.patched.bed
chr1    10468   10470   setA    0.97
chr1    10470   10472   setA    1
chr1    10483   10485   setA    0.95

$ more B.patched.bed
chr1    10468   10470   setB    0.79
chr1    10470   10472   setB    NAN
chr1    10483   10485   setB    0.88

$ more C.patched.bed
chr1    10468   10470   setC    0.91
chr1    10470   10472   setC    0.97
chr1    10483   10485   setC    0.87

From here out, it's the same as before. Take the union of the patched sets:

$ bedops --everything *.patched.bed > union.bed

Then map as before:

$ bedmap --echo --echo-map-score --delim '\t' --multidelim '\t' --prec 2 partition.bed union.bed
chr1    10468   10470   0.97    0.79    0.91
chr1    10470   10472   1.00    nan     0.97
chr1    10483   10485   0.95    0.88    0.87

(Note: This requires bedmap v2.4.29 or greater.)

ADD COMMENTlink modified 23 months ago • written 23 months ago by Alex Reynolds28k

I caved and we put out 2.4.29 earlier than expected. Hopefully this should trickle down to package managers soon, depending on people's schedules, but we have compiled binaries just in case. http://bedops.readthedocs.io/en/latest/content/revision-history.html

ADD REPLYlink written 23 months ago by Alex Reynolds28k

Hi Alex, Thank you for your time. As far as I can see it works well.

ADD REPLYlink modified 22 months ago • written 22 months ago by theodore40
0
gravatar for Alex Reynolds
23 months ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

You could use BEDOPS bedops --partition to partition the genomic space and bedmap to map the unioned set against the partitioned space.

Abstractly:

$ bedops --partition A.bed B.bed C.bed > partitionedABC.bed
$ bedops --everything A.bed B.bed C.bed > unionABC.bed
$ bedmap --echo --echo-map --delim '\t' partitionedABC.bed unionABC.bed > answer.bed

You could use awk to re-columnize answer.bed to whatever special format you're aiming at.

Or, as a one-liner, without intermediate files to slow things down:

$ bedmap --echo --echo-map --delim '\t' <(bedops --partition A.bed B.bed C.bed) <(bedops --everything A.bed B.bed C.bed) > answer.bed
ADD COMMENTlink written 23 months ago by Alex Reynolds28k

Thank you Alex for the answer. Since I am not familiar with the bedops suit could you please elaborate on the need for performing 3 steps and latter an additional awk? I checked briefly the functions and it is not intuitive hat this would be the best way to get the results.

ADD REPLYlink written 23 months ago by theodore40

Your question seems to suggest you want to partition the input space and then map all your elements to the partitioned space. It might help to read the documentation for bedops and particularly the --partition operator. Then read the portion of the bedmap documentation that relates to --echo-map-* operators.

ADD REPLYlink written 23 months ago by Alex Reynolds28k

Ok, I followed your advice and although the binning is correct I found that the data is somehow scrambled. so, the first lines of the original files are: (NOTE THAT I FOLLOW THE SAME PATTERN FOR THE FILE'S ORDER E016-E004-E013)

trotos@StupidMint:~$ head /media/trotos/4TB/hESC/Encode_chip_files/methylation/E016_WGBS_FractionalMethylation_allCHROM.bedgraph
chr1    10468   10470   0.97
chr1    10470   10472   1
chr1    10483   10485   0.95
chr1    10488   10490   0.95
chr1    10492   10494   0.94
chr1    10496   10498   0.91
chr1    10524   10526   1
chr1    10541   10543   0.92
chr1    10562   10564   0.96
chr1    10570   10572   0.98
---  19:13:12
trotos@StupidMint:~$ head /media/trotos/4TB/hESC/Encode_chip_files/methylation/E004_WGBS_FractionalMethylation_allCHROM.bedgraph
chr1    10468   10470   0.79
chr1    10470   10472   0.93
chr1    10483   10485   0.88
chr1    10488   10490   0.91
chr1    10492   10494   0.91
chr1    10496   10498   0.9
chr1    10524   10526   0.95
chr1    10541   10543   0.98
chr1    10562   10564   0.96
chr1    10570   10572   0.94
---  19:13:26
trotos@StupidMint:~$ head /media/trotos/4TB/hESC/Encode_chip_files/methylation/E013_WGBS_FractionalMethylation_allCHROM.bedgraph
chr1    10468   10470   0.91
chr1    10470   10472   0.97
chr1    10483   10485   0.87
chr1    10488   10490   0.94
chr1    10492   10494   0.97
chr1    10496   10498   0.96
chr1    10524   10526   0.99
chr1    10541   10543   0.95
chr1    10562   10564   0.93
chr1    10570   10572   0.91

and after your pipeline:

./bedops --partition E016_WGBS_FractionalMethylation_allCHROM.bedgraph E004_WGBS_FractionalMethylation_allCHROM.bedgraph E013_WGBS_FractionalMethylation_allCHROM.bedgraph > /partitionedBEDOPS
./bedops --everything E016_WGBS_FractionalMethylation_allCHROM.bedgraph E004_WGBS_FractionalMethylation_allCHROM.bedgraph E013_WGBS_FractionalMethylation_allCHROM.bedgraph > everything
./bedmap --echo --echo-map --delim '\t' partitionedBEDOPS everything > whatTOdo

with some sed-awk to clear the output:

sed 's/\;/\t/g' /media/trotos/4TB/hESC/Encode_chip_files/methylation/whatTOdo | awk '{print$1"\t"$2"\t"$3"\t"$7"\t"$11"\t"$15}' | head
chr1    10468   10470   0.79    0.91    0.97
chr1    10470   10472   1   0.93    0.97
chr1    10483   10485   0.87    0.88    0.95
chr1    10488   10490   0.95    0.91    0.94
chr1    10492   10494   0.91    0.94    0.97
chr1    10496   10498   0.96    0.9 0.91
chr1    10524   10526   0.95    0.99    1
chr1    10541   10543   0.98    0.92    0.95
chr1    10562   10564   0.93    0.96    0.96
chr1    10570   10572   0.98    0.91    0.94

As you can see the values are completely messed up. The first line should have been: chr1 10468 10470 0.97 0.79 0.91 the second line does not even follow the order of the first line.

ADD REPLYlink modified 23 months ago • written 23 months ago by theodore40

I think there's a way to do this that is not pretty, but prepends the ID field in a way that provides results to resort lexicographically, so that map results can be printed in a consistent order. I'll check to see if there is a smarter way to do this and, either way, I'll probably follow up with an answer tomorrow or Tuesday.

ADD REPLYlink modified 23 months ago • written 23 months ago by Alex Reynolds28k
0
gravatar for theodore
23 months ago by
theodore40
Germany
theodore40 wrote:

I will wait patiently. Thank you for your efforts.

ADD COMMENTlink written 23 months ago by theodore40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1077 users visited in the last hour