I have called peaks across 10 different ChIP-seq studies. The same protein was assayed in each study. I am looking to create a 'master' peak list by intersecting/overlapping the ten studies to give me highly reproducible non-overlapping peaks. I also want to create peaks which are present in 1 .. N studies (for example N == 3), or peaks which are present in at least 1 .. N studies (for example N >= 3). To answer these questions I have tried two approaches:
The first uses bedtools multiinter to get intersecting regions (exact coordinate intersections):
bedtools multiinter -i *.narrowPeak | awk '$4 >= 3' | cut -f 1-3 | bedtools sort -i stdin | bedtools merge -i stdin -d 138
Here I can change the awk condition to get the regions I need. At the end of the command I am merging regions within 138 bp apart (average estimated fragment size across all 10 studies). However, there are two problems with this approach: First, this produces a lot of spurious peak regions. By that I mean, regions created by a small overlap between peaks from different studies. I could just remove these regions using some threshold, but what can I use to inform this decision and how can I justify one threshold over another? Second, if I look at the number of peaks created by (N == 9 and N == 10), these do not add up to the number produced by (N >= 9) as you might expect. This is because the intersecting regions generated are slightly different between the two operations.
The second uses bedops to get overlapping regions (Peak A is an element of Peak B):
bedops -u *.narrowPeak | bedmap --echo --echo-map-id-uniq - | awk -F"|" '(split($2, a, ";") >= 3)'
This creates a master list of peaks rather than intersecting regions. I can then merge these regions to produce unique peak regions. However, there is also two problems with this approach: First, a particular unique peak region will always be as long as the longest overlapping peak amongst the studies (because we're simply overlapping and merging, not intersecting and merging). This reduces the resolution of the peak-call. Second, the awk command doesn't guarantee that the peak region is in exactly N studies or >= N studies. Here are the number of peaks identified from different awk comparisons:
# Trying to get unique peak-regions in all 10 studies (== 10): 5129 regions > bedops -u *.narrowPeak | bedmap --echo --echo-map-id-uniq - | awk -F"|" '(split($2, a, ";") == 10)' | wc -l # This comparison should give the same number as above (>= 10), but doesn't: 6506 regions > bedops -u *.narrowPeak | bedmap --echo --echo-map-id-uniq - | awk -F"|" '(split($2, a, ";") >= 10)' | wc -l # There are even peaks produced for greater than 10 studies (> 10): 1376 regions bedops -u *.narrowPeak | bedmap --echo --echo-map-id-uniq - | awk -F"|" '(split($2, a, ";") > 10)' | wc -l
I didn't expect to get any peaks for (N > 10) because we only have 10 studies. However if you look at the output from this command you can see that multiple peaks from the same study which overlap a single peak in another study are counted (the study and peak number are listed after the "|" delimiter in the format "Study_PeakNumber").
chr1 7124410 7126690 King_30 1000 . 42.57261 296.17303 290.95993 1778|Buecker_38;Buecker_39;Chronis_40;Chronis_41;Hardison_4;Jacinto_2;Karwacki_50;Karwacki_51;King_30;Liu_22;Liu_23;Marson_68;Marson_69;Whyte_51;Whyte_52
Apologies if this post is confusing/long. I've found it quite hard to explain this problem clearly/succinctly. Any help greatly appreciated.