Question: get intervals of a fraction of bed-files
0
gravatar for markus.glass
3.5 years ago by
markus.glass20
markus.glass20 wrote:

Hello,

I have a set of 8 .bed files and I want to extract those (sub-)intervals that are present in a certain fraction of these files, e.g., intervals that occur in 6 of these 8 files. Does anyone know an efficient way/tool to do this?

Markus

chip-seq • 1.1k views
ADD COMMENTlink modified 3.5 years ago by Alex Reynolds31k • written 3.5 years ago by markus.glass20
1
gravatar for Alex Reynolds
3.5 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Here is an efficient solution with BEDOPS, which avoids approaches with other tools that require M^N pairwise comparisons, for M elements and N sets.

Say you have N files (your specific example uses 8 files, but this shows a generic solution for N files). You want all elements from N files, which overlap at least F (6) elements from each of all N sets.

Label each BED file to note where it is coming from:

$ for inFn in `find . -name "*.bed"`; do echo $inFn; outFn=${inFn%%.*}.labeled.bed; prefix=${inFn%%.*}; echo ${prefix} | cut -f1-3 ${inFn} - | sort-bed - > $outFn; done

Set shell variables with the number of sets and your overlap threshold, e.g.:

$ export N=8
$ export F=6

Then do the set operations on the labeled files with BEDOPS bedops and bedmap:

$ bedops --everything file1.labeled.bed file2.labeled.bed ... fileN.labeled.bed \
    | bedmap --echo --echo-map-id-uniq --fraction-map 1 --delim '\t' - \
    | cut -f1-3,5 - \
    | awk -vF=${F} '{ n = split($4, a, ";"); if (n >= F) { print $0; } }' \
    > intervals_overlapping_at_least_F_inputs.bed

The file intervals_overlapping_at_least_F_inputs.bed contains a list of intervals that are contained entirely within at least F sets, with an ID value indicating its origin set.

ADD COMMENTlink written 3.5 years ago by Alex Reynolds31k

Hello Pierre, your idea sounds good, I will try it. And Alex: bedops seem to be an interesting toolbox, I will have a closer look at it. Thank you!

ADD REPLYlink written 3.5 years ago by markus.glass20
0
gravatar for Pierre Lindenbaum
3.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:
  • for each bed : convert your bed to uniq-base-bed and sort | uniq
  • uniq -ccollect the base positions occuring more than 6 times
  • convert back to bed
  • merge with bedtools
 for F in *.bed ; do awk '{B=int($2);E=int($3);for(i=B;i<E;++i) {printf("%s\t%d\t%d\n",$1,i,i+1);}}' $F | sort | uniq ; done | sort | uniq -c | awk '{if($1>=6) printf("%s\t%s\t%s\n",$2,$3,$4);}' | bedtools merge
ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Pierre Lindenbaum131k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1846 users visited in the last hour