Is there an easy way of finding, probably with bedtools, given a window size, the top N most correlated regions when comparing two bed/wig files? For example, in comparing two bed/wig/bam files that have PolII data for 2 conditions, to give the top N windows where the wiggle profiles are most similar?
If I understand the problem correctly, here's a basic workflow that may work for you. One basic assumption I am making here is that by PolII "data", you mean alignments from a ChIP-seq experiment.
wigToBigWig polII.condA.wig chrom.sizes polII.condA.bw wigToBigWig polII.condB.wig chrom.sizes polII.condB.bw bigWigToBedGraph polII.condA.bw chrom.sizes polII.condA.bedg bigWigToBedGraph polII.condB.bw chrom.sizes polII.condB.bedg
2. Make a BED file of non-overlapping 100kb windows. I chose 100Kb. These will end up being the windows from which you are assessing the correlation b/w the two conditions.
bedtools makewindows -g chrom.sizes -w 100000 > windows.bed
3. Measure the coverage in the BEDGRAPH files at each 100Kb window. This step uses a new (unreleased) tool called "map" (think functional programming), that summarizes data for intervals in one file (here, the "score/depth" column from your BEDGRAPHs) based on overlaps with another file (i.e., windows.bed). In other words, we are summing the score column for each BEDGRAPH interval that overlaps a window in windows.bed. Column 4 (-c 4) is the score/depth column, and "-o sum" is the operation that should be applied to that column. NOTE: both the BEDGRAPH and windows.bed files must be sorted by chrom, then start. If this workflow seems roughly like what you want, you can grab the "map" function from the bedtools GitHub repository.
bedtools map -a windows.bed -b polII.condA.bedg -c 4 -o sum \ > polII.condA.windows.bedg bedtools map -a windows.bed -b polII.condB.bedg -c 4 -o sum \ > polII.condB.windows.bedg
4. Combine the two files tow create a single file with the intervals and the values from each condition.
paste <(cut -f 1-4 polII.condA.windows.bedg) \ <(cut -f 4 polII.condB.windows.bedg) \ > polII.conditions.windows.txt # the output will look something like this, where the first three columns # are the windows, and the last two columns are the scores # from conditions A and B, respectively. E.g.: chr1 0 100000 163 172 chr1 100000 200000 313 94 chr1 200000 300000 1 774 ...
5. Use the counts above (columns 4 and 5) to compute a measurement of correlation. I'm not sure of what the right measure is here, so I will name this piece your script, but in principle, assuming you have some score as the 6th column, you could just use
head to pick the top N values. Perhaps a more poweful alternative would be to load this file into R and use it to assess correlation, etc.
N=10 yourscript.[pl|.py|.R|.rb|.*] polII.conditions.windows.txt | \ sort -k6,6nr | \ head -n $N | \ > table1.naturepaper.txt