get intervals of a fraction of bed-files
2
0
Entering edit mode
6.9 years ago
markus.glass ▴ 40

Hello,

I have a set of 8 .bed files and I want to extract those (sub-)intervals that are present in a certain fraction of these files, e.g., intervals that occur in 6 of these 8 files. Does anyone know an efficient way/tool to do this?

Markus

ChIP-Seq • 1.9k views
ADD COMMENT
1
Entering edit mode
6.9 years ago

Here is an efficient solution with BEDOPS, which avoids approaches with other tools that require M^N pairwise comparisons, for M elements and N sets.

Say you have N files (your specific example uses 8 files, but this shows a generic solution for N files). You want all elements from N files, which overlap at least F (6) elements from each of all N sets.

Label each BED file to note where it is coming from:

$ for inFn in `find . -name "*.bed"`; do echo $inFn; outFn=${inFn%%.*}.labeled.bed; prefix=${inFn%%.*}; echo ${prefix} | cut -f1-3 ${inFn} - | sort-bed - > $outFn; done

Set shell variables with the number of sets and your overlap threshold, e.g.:

$ export N=8
$ export F=6

Then do the set operations on the labeled files with BEDOPS bedops and bedmap:

$ bedops --everything file1.labeled.bed file2.labeled.bed ... fileN.labeled.bed \
    | bedmap --echo --echo-map-id-uniq --fraction-map 1 --delim '\t' - \
    | cut -f1-3,5 - \
    | awk -vF=${F} '{ n = split($4, a, ";"); if (n >= F) { print $0; } }' \
    > intervals_overlapping_at_least_F_inputs.bed

The file intervals_overlapping_at_least_F_inputs.bed contains a list of intervals that are contained entirely within at least F sets, with an ID value indicating its origin set.

ADD COMMENT
0
Entering edit mode

Hello Pierre, your idea sounds good, I will try it. And Alex: bedops seem to be an interesting toolbox, I will have a closer look at it. Thank you!

ADD REPLY
0
Entering edit mode
6.9 years ago
  • for each bed : convert your bed to uniq-base-bed and sort | uniq
  • uniq -ccollect the base positions occuring more than 6 times
  • convert back to bed
  • merge with bedtools
 for F in *.bed ; do awk '{B=int($2);E=int($3);for(i=B;i<E;++i) {printf("%s\t%d\t%d\n",$1,i,i+1);}}' $F | sort | uniq ; done | sort | uniq -c | awk '{if($1>=6) printf("%s\t%s\t%s\n",$2,$3,$4);}' | bedtools merge
ADD COMMENT

Login before adding your answer.

Traffic: 2090 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6