Question: compare two groups of multiple bed files
0
gravatar for blur
16 days ago by
blur140
European Union
blur140 wrote:

Hi,

I have 2 experiments, each with 10 peak files in bedtools format. My idea was to check which peaks are repeated in 6/10 experiments in group 1 - I'll call the results group1_repeated.bed, Then 6/10 in group 2 to created group2_repeated.bed The idea is then to compare group1_repeated.bed to group2_repeated.bed.

I tried bedtools multiinter to get the group1_repeated, but this breaks down the peaks into sub-peaks and, to be honest, I'd prefer the end result to be the maximum peaks of the 10. I also tried putting all 10 files together and doing bedtools merge, but then I can't count the number of repeats...

I checked the forum and saw BEDOPS, but I am not sure if that's the right tool for me, as I saw someone say that if you have one big peak that overlaps several shorter ones, the count will be all off...

Any ideas?

Thanks in advance,

bedfiles bedtools • 76 views
ADD COMMENTlink modified 16 days ago by Alex Reynolds31k • written 16 days ago by blur140
0
gravatar for Alex Reynolds
16 days ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

You could union the files, specify the overlap criterion for mapping, and then post-process the result to count true peaks. The ID field (fifth column) can be useful as an identifier. Using intervals alone is probably not enough, because of cases where peaks overlap.

For example, to union sorted BED files:

$ bedops -u exp1peaks1.bed exp1peaks2.bed ... exp1peaks10.bed > exp1peaks.bed

Then map the unioned set of peaks against each other:

$ bedmap --count --echo --echo-map-id-uniq --fraction-map 0.51 exp1peaks.bed | awk '($1>=6)' | cut -f2- > candidates.bed
  • The --count operator returns the number of mapped peaks that overlap the reference peak.
  • The --echo operator returns the reference peak.
  • The --echo-map-id-uniq operator returns the unique IDs of mapped peaks that overlap the reference peak.
  • The --fraction-map 0.51 operator is the overlap criterion, which requires that more than half of a mapped peak must overlap the reference peak to be called an overlap.

The awk and cut statements at the end return candidate peaks, where there are six or more mapped peaks that met the bedmap overlap criterion. The output is another sorted BED file.

To deal with this:

I saw someone say that if you have one big peak that overlaps several shorter ones, the count will be all off.

Once you have candidate peaks, you can use a Python script to look at the contents of the mapped peak IDs returned by --echo-map-id-uniq.

Using peak IDs that follow a parseable pattern will help you count true overlaps among common peaks in candidates.bed, filtering out those where you run into the above scenario.

The procedure could be redone on files for the second experiment. Then you can compare intervals across two experiments and do a final count.

ADD COMMENTlink written 16 days ago by Alex Reynolds31k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1197 users visited in the last hour