Count Columns For Each Window
4
0
Entering edit mode
8.2 years ago
roll ▴ 330

I would like to count columns for the matching windows (that is in bed format). I tried awk but too slow.

For example, I have the following two windows

chr1:100-1000
chr1:1500-3000


Fort these 2 windows I found the following matches and would like to count 6th column based on whether the last column is 1 or 0.

chr1 100 1000 chr1 200 0 1 0
chr1 100 1000 chr1 500 0 4 0
chr1 100 1000 chr1 700 0 6 1
chr1 1500 3000 chr1 2000 0 9 1
chr1 1500 3000 chr1 2000 0 1 0


My desired results will be

chr1 100 1000 6/11
chr1 1500 3000 9/10


I tried doing this with awk but since i have millions of entry it is very slow. Is there a bioinformatics tool that do such counts? Or any faster solution?

awk • 2.6k views
0
Entering edit mode

So let me try to understand this from what I see in your example. Let's say for window of chr 100 1000, you sum up all the numbers in the seventh column that corresponds to this window (hence giving the final value of 11), and then for all those rows, you are looking for the rows that has value of 1 in the final column. Is that right?

0
Entering edit mode

Also, I assume this is tab-delimited file?

0
Entering edit mode

yes exactly. My attempt was to have a list of all the windows and use while loop in perl. And awk to count the counts. But it is very slow.

0
Entering edit mode

Can you have more than one rows with last column value of 1? If so, do you add those values together (so, if the first line in your example had last column value of 1, would this be 7/11?

0
Entering edit mode

yes you can have more than 1. The example you are giving is correct.

3
Entering edit mode
8.2 years ago

Use awk to split your BED data into two files:

1. The first file contains all 0/1 overlaps (where the eighth column is either 0 or 1), with the seventh column moved to the fifth (score) column
2. The second file contains 1-only overlaps (where the eighth column is 1), with the seventh column moved to the fifth (score) column

Let's call these two files allOverlaps.bed and oneOverlaps.bed. Make sure they are sorted per BEDOPS sort-bed:

$awk '{print$1"\t"$2"\t"$3"\t"$4"\t"$7}' overlaps.bed | sort-bed - > allOverlaps.bed
$awk '{if ($8==1) {print $1"\t"$2"\t"$3"\t"$4"\t"$7}}' overlaps.bed | sort-bed - > oneOverlaps.bed  Set up a third, tab-delimited, sorted BED file called windows.bed, with your windows of interest: chr1 100 1000 chr1 1500 3000 ...  Here's a one-liner where you apply two BEDOPS bedmap operations in piped, serial fashion: $ bedmap --echo --sum --delim '\t' windows.bed oneOverlaps.bed | bedmap --echo --sum --delim '/' - allOverlaps.bed > answer.bed


The file answer.bed should follow your specified output format, I think.

To understand how this works, you are applying the --sum operand on the score column of any mapped overlaps (as a result, note that this allows multiple rows with 1 in the last column). By doing this map operation on each category of overlaps, and then by piping results from the first set to the next, we get the final answer.

Should be speedy, I'd think, though awk is tough to beat in straight-up parsing. The sort-bed tool is much faster than alternatives and the bedmap application works very efficiently with BED files (and since your data are already in that format, more or less, why not use the tools to match). We've got some performance enhancements coming soon to make that even faster.

2
Entering edit mode
8.2 years ago
#!/usr/bin/env perl
use strict;
use warnings;

my %store = ();

while (<>) {
chomp;
next if /^\s*$/; my @l = split; my$coord = $l[0].':'.$l[1].':'.$l[2]; ## your chromosome names better don't contain ':'$store{$coord} = [0,0] unless (exists$store{$coord});$store{$coord}->[0] +=$l[6];
$store{$coord}->[1] += $l[6] if$l[7] == 1;
}

while (my ($k,$v) = each %store) {
print join "\t", ( (split (':', $k)),$v->[1].'/'.$v->[0]), "\n"; } __END__  Fast enough? ADD COMMENT 0 Entering edit mode That sure was fast...I was thinking of a same approach, except I am using python. ADD REPLY 0 Entering edit mode Fine, post yours, and op can do the speed test ;) ADD REPLY 0 Entering edit mode Hmm...16 lines of Perl that only uses while loop vs. 78 lines of python codes containing for loops...I think there already is a winner before the speed test has even begun. You sir, has already put my coding skill to shame :) ADD REPLY 1 Entering edit mode 8.2 years ago Not sure if I completely understand the question, but I'll give it a shot... If the starting point is a bed file of target regions already intersected with the "score" regions, like this: cat b.bed chr1 100 1000 chr1 200 0 1 0 chr1 100 1000 chr1 500 0 4 0 chr1 100 1000 chr1 700 0 6 1 chr1 1500 3000 chr1 2000 0 9 1 chr1 1500 3000 chr1 2000 0 1 0  Then with awk & bedtools you can do: awk '{print$0 "\t" $7 *$8}' b.bed \
| groupBy -g 1,2,3 -c 7,9 -ops sum,sum \
| awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$5"/"\$4}'


(The second awk is just to have the output formatted exactly as in the OP's example).

If instead the starting point is a bed file of target regions (w.bed) and a bed file of score intervals (a.bed), create the b.bed file above with:

## Example input files
cat w.bed
chr1 100 1000
chr1 1500 3000

cat a.bed
chr1 199 200 0 1 0
chr1 499 500 0 4 0
chr1 699 700 0 6 1
chr1 1999 2000 0 9 1
chr1 1999 2000 0 1 0

## Create b.bed above:
intersectBed -b a.bed -a w.bed -wa -wb | cut -f1-4,6- > b.bed

0
Entering edit mode
8.2 years ago
youngcsong ▴ 100

Well, nonetheless, here's my code...big apologies for stealing someone's thread, but a critic here would help tons...

from optparse import OptionParser
from os import listdir
import os, getpass, sys, string, re

if __name__ == '__main__':
parser = OptionParser(usage='%prog ...')
parser.add_option("-t", "--table_file", help="Variation table extracted from VCF file (i.e. 4th and 5th columns of VCF file)")

(options, args) = parser.parse_args()

tableFile = options.table_file

bedHandle = open(tableFile,"rb")

# This is where our results go in
compareDict = {}

totalDenominator = 0
totalNumerator = 0

# Let's process the first line
bedFirstLine = string.strip(bedFirstLine)

bedFirstTab = bedFirstLine.split("\t")

# This value gets added to denominator and also to the numerator if the switch is equal to one
firstValue = bedFirstTab[6]
totalDenominator = firstValue

firstSwitch = bedFirstTab[7]

# If the switch is equal to one, we add the corresponding value to the numerator
if int(firstSwitch) == 1:
totalNumerator += firstValue

# Our key to the dictionary consists of the chromosome number, first coordinate and second coordinate of the window, all concatenated by "_"
compareKey = bedFirstTab[0] + " \t" + bedFirstTab[1] + " \t" + bedFirstTab[2]

compareDict[compareKey] = (totalNumerator, totalDenominator)

# We now process the subsequent lines by using the methods similar to the one that we applied to the first line
bedNextLines = string.strip(bedNextLines)
bedNextTab = bedNextLines.split("\t")
keyCompare = bedNextTab[0] + " \t" + bedNextTab[1] + " \t" + bedNextTab[2]
tempValue = bedNextTab[6]
tempSwitch = bedNextTab[7]

if compareDict.has_key(keyCompare):
currEntryList = compareDict[keyCompare]
currNumerator = int(currEntryList[0])
currDenominator = int(currEntryList[1])

currDenominator += int(tempValue)

if int(tempSwitch) == 1:
currNumerator += int(tempValue)

compareDict[keyCompare] = (currNumerator, currDenominator)
else:
keyCompare = bedNextTab[0] + " \t" + bedNextTab[1] + " \t" + bedNextTab[2]
nextValue =  bedNextTab[6]
totalDenominator = int(nextValue)

nextSwitch = int(bedNextTab[7])

if int(nextSwitch) == 1:
totalNumerator += int(nextValue)

compareDict[keyCompare] = (totalNumerator, totalDenominator)

for eachKey in compareDict.keys():
print eachKey,"\t",compareDict[eachKey][0],"/",compareDict[eachKey][1]

1
Entering edit mode

The main redundancy seems to come from treating the first line separately duplicating a lot of code. I don't see why this would be necessary. The rest is almost 1:1 equivalent to the perl code (+ a few comments and some assignments options and usage message, in your favor).

0
Entering edit mode

yup...the redundancy is certainly the point of concern in this code, though I am contemplating how to deal with this. One way would be to take the part that is redundant and make it as separate function, and then replace all the chunks of redundant code with function()...but even then that becomes little tedious (or I am just being lazy, given that it's Friday afternoon at 5 here in Pacific Northwest).

0
Entering edit mode

I think you can simpy treat each line equally. If the dict doesn't contain the key, just add (0,0) for that key: look at these blocks:

This is redundant and can be removed:

totalDenominator = 0
totalNumerator = 0

# Let's process the first line
[...snip...]
compareDict[compareKey] = (totalNumerator, totalDenominator)


This can be abbreviated:

   if compareDict.has_key(keyCompare):
[....]
else:
keyCompare = bedNextTab[0] + " \t" + bedNextTab[1] + " \t" + bedNextTab[2] # duplicate construction of key!
[....]
compareDict[keyCompare] = (totalNumerator, totalDenominator)


And can be replaced by:

  keyCompare = bedNextTab[0] + " \t" + bedNextTab[1] + " \t" + bedNextTab[2]
if not compareDict.has_key(keyCompare):
compareDict[keyCompare] = (0,0)
currEntryList = compareDict[keyCompare]
currNumerator = int(currEntryList[0])
currDenominator = int(currEntryList[1])
currDenominator += int(tempValue)
if int(tempSwitch) == 1:
currNumerator += int(tempValue)
compareDict[keyCompare] = (currNumerator, currDenominator)