Bedtools: Top N Most Similar Regions When Comparing Two Bed/Wig/Bam Files?
Entering edit mode
10.5 years ago
Ahdf-Lell-Kocks ★ 1.6k

Is there an easy way of finding, probably with bedtools, given a window size, the top N most correlated regions when comparing two bed/wig files? For example, in comparing two bed/wig/bam files that have PolII data for 2 conditions, to give the top N windows where the wiggle profiles are most similar?

bedtools correlation • 4.3k views
Entering edit mode

depends what you mean by "easy" you could do a binomial test for each window. it'd be simpler to answer your question if you have a few lines of example data.

Entering edit mode
10.5 years ago

If I understand the problem correctly, here's a basic workflow that may work for you. One basic assumption I am making here is that by PolII "data", you mean alignments from a ChIP-seq experiment.

1. Convert your WIG files to BEDGRAPH with the venerable Jim Kent's wigToBigWig and bigWigToBedGraph.

  wigToBigWig polII.condA.wig chrom.sizes
  wigToBigWig polII.condB.wig chrom.sizes

  bigWigToBedGraph chrom.sizes polII.condA.bedg
  bigWigToBedGraph chrom.sizes polII.condB.bedg

2. Make a BED file of non-overlapping 100kb windows. I chose 100Kb. These will end up being the windows from which you are assessing the correlation b/w the two conditions.

  bedtools makewindows -g chrom.sizes -w 100000 > windows.bed

3. Measure the coverage in the BEDGRAPH files at each 100Kb window. This step uses a new (unreleased) tool called "map" (think functional programming), that summarizes data for intervals in one file (here, the "score/depth" column from your BEDGRAPHs) based on overlaps with another file (i.e., windows.bed). In other words, we are summing the score column for each BEDGRAPH interval that overlaps a window in windows.bed. Column 4 (-c 4) is the score/depth column, and "-o sum" is the operation that should be applied to that column. NOTE: both the BEDGRAPH and windows.bed files must be sorted by chrom, then start. If this workflow seems roughly like what you want, you can grab the "map" function from the bedtools GitHub repository.

  bedtools map -a windows.bed -b polII.condA.bedg -c 4 -o sum \
  bedtools map -a windows.bed -b polII.condB.bedg -c 4 -o sum \

4. Combine the two files tow create a single file with the intervals and the values from each condition.

 paste <(cut -f 1-4 \
       <(cut -f 4 \

 # the output will look something like this, where the first three columns 
 # are the windows, and the last two columns are the scores 
 # from conditions A and B, respectively. E.g.:
 chr1    0        100000    163    172
 chr1    100000   200000    313    94
 chr1    200000   300000    1      774

5. Use the counts above (columns 4 and 5) to compute a measurement of correlation. I'm not sure of what the right measure is here, so I will name this piece your script, but in principle, assuming you have some score as the 6th column, you could just use sort and head to pick the top N values. Perhaps a more poweful alternative would be to load this file into R and use it to assess correlation, etc.

 yourscript.[pl|.py|.R|.rb|.*] | \
    sort -k6,6nr | \
    head -n $N | \
    > table1.naturepaper.txt
Entering edit mode

brilliant! thanks!


Login before adding your answer.

Traffic: 938 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6