Question

compare two groups of multiple bed files

0

Entering edit mode

3.4 years ago

blur ▴ 280

Hi,

I have 2 experiments, each with 10 peak files in bedtools format. My idea was to check which peaks are repeated in 6/10 experiments in group 1 - I'll call the results group1_repeated.bed, Then 6/10 in group 2 to created group2_repeated.bed. The idea is then to compare group1_repeated.bed to group2_repeated.bed.

I tried bedtools multiinter to get the group1_repeated, but this breaks down the peaks into sub-peaks and, to be honest, I'd prefer the end result to be the maximum peaks of the 10. I also tried putting all 10 files together and doing bedtools merge, but then I can't count the number of repeats.

I checked the forum and saw BEDOPS, but I am not sure if that's the right tool for me, as I saw someone say that if you have one big peak that overlaps several shorter ones, the count will be all off.

Any ideas?

Thanks in advance

bed bedtools • 819 views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 3.4 years ago by blur ▴ 280

score 0 · Answer 1 · 2020-11-18

You could union the files, specify the overlap criterion for mapping, and then post-process the result to count true peaks. The ID field (fifth column) can be useful as an identifier. Using intervals alone is probably not enough, because of cases where peaks overlap.

For example, to union sorted BED files:

$ bedops -u exp1peaks1.bed exp1peaks2.bed ... exp1peaks10.bed > exp1peaks.bed

Then map the unioned set of peaks against each other:

$ bedmap --count --echo --echo-map-id-uniq --fraction-map 0.51 exp1peaks.bed | awk '($1>=6)' | cut -f2- > candidates.bed

The --count operator returns the number of mapped peaks that overlap the reference peak.
The --echo operator returns the reference peak.
The --echo-map-id-uniq operator returns the unique IDs of mapped peaks that overlap the reference peak.
The --fraction-map 0.51 operator is the overlap criterion, which requires that more than half of a mapped peak must overlap the reference peak to be called an overlap.

The awk and cut statements at the end return candidate peaks, where there are six or more mapped peaks that met the bedmap overlap criterion. The output is another sorted BED file.

To deal with this:

I saw someone say that if you have one big peak that overlaps several shorter ones, the count will be all off.

Once you have candidate peaks, you can use a Python script to look at the contents of the mapped peak IDs returned by --echo-map-id-uniq.

Using peak IDs that follow a parseable pattern will help you count true overlaps among common peaks in candidates.bed, filtering out those where you run into the above scenario.

The procedure could be redone on files for the second experiment. Then you can compare intervals across two experiments and do a final count.