The only problem is that the next line is not offset from the last by one. You may want to re-check the expected output
I have a question related on how I could use a range given to split it in intervals of 100 length max. I'm looking for doing this in AWK, Shell Script, Perl or Python. Just to show a small example, I have a tab-delimited file that contains:
This is how it looks:
chr1 37850 38536 MACS_peak_1 7 chr1 820769 821857 MACS_peak_4 11 chr1 5018483 5019041 MACS_peak_9 6
And after taking each range, and split them in 100 intervals, the output should look like this. The last two columns are not important as much as obtain the ranges splitted, and I showed for this example to indicate the number of split and the length of the split, respectively:
chr1 37850 37950 1 100 chr1 37951 38051 2 100 chr1 38052 38152 3 100 chr1 38153 38253 4 100 chr1 38254 38354 5 100 chr1 38355 38455 6 100 chr1 38456 38536 7 80 chr1 820769 820869 1 100 chr1 820870 820970 2 100 chr1 820971 821071 3 100 chr1 821072 821172 4 100 chr1 821173 821273 5 100 chr1 821274 821374 6 100 chr1 821375 821475 7 100 chr1 821476 821576 8 100 chr1 821577 821677 9 100 chr1 821678 821778 10 100 chr1 821779 821857 11 78 chr1 5018483 5018583 1 100 chr1 5018584 5018684 2 100 chr1 5018685 5018785 3 100 chr1 5018786 5018886 4 100 chr1 5018887 5018987 5 100 chr1 5018988 5019041 6 53
In Python:
fh = open('input_file.txt')
out = open('output_file.txt','w')
def chunk_range(x1, x2, numb):
size = ((x2 - x1 + 1 ) / numb) + 1
while x1 <= x2:
next = x1 + size
yield x1, min(next - 1, x2)
x1 = next
for line in fh:
line = line.split()
chr = line[0]
x1 = int(line[1])
x2 = int(line[2])
numb = int(line[4])
n = 1
for st, end in chunk_range(x1,x2,numb):
out.write('%s\t%i\t%i\t%i\t%i\n' % (chr, st, end, n, end-st))
n+=1
fh.close()
out.close()
input_file.txt
chr1 37850 38536 MACS_peak_1 7
chr1 820769 821857 MACS_peak_4 11
chr1 5018483 5019041 MACS_peak_9 6
output_file.txt
chr1 37850 37948 1 98
chr1 37949 38047 2 98
chr1 38048 38146 3 98
chr1 38147 38245 4 98
chr1 38246 38344 5 98
chr1 38345 38443 6 98
chr1 38444 38536 7 92
chr1 820769 820868 1 99
chr1 820869 820968 2 99
chr1 820969 821068 3 99
chr1 821069 821168 4 99
chr1 821169 821268 5 99
chr1 821269 821368 6 99
chr1 821369 821468 7 99
chr1 821469 821568 8 99
chr1 821569 821668 9 99
chr1 821669 821768 10 99
chr1 821769 821857 11 88
chr1 5018483 5018576 1 93
chr1 5018577 5018670 2 93
chr1 5018671 5018764 3 93
chr1 5018765 5018858 4 93
chr1 5018859 5018952 5 93
chr1 5018953 5019041 6 88
Here's my Perl solution:
my $input = "
chr1 37850 38536 MACS_peak_1 7
chr1 820769 821857 MACS_peak_4 11
chr1 5018483 5019041 MACS_peak_9 6
";
foreach (split (/\n/, $input)){
next unless $_;
my ($chr, $start, $end, $id, $times) = split (/\s+/);
my $i = 1;
for ($i=1; $i<$times; $i++){
print $chr."\t".($start+($i-1)*101)."\t".($start+$i*101)."\t$i\t100\n";
}
print $chr."\t".($start+($i-1)*101)."\t$end\t$i\t".($end-$start-($i-1)*101)."\n";
}
The output should be exactly the same as in your question.
And here's a Perl code to make the intervals even:
my $input = "
chr1 37850 38536 MACS_peak_1 7
chr1 820769 821857 MACS_peak_4 11
chr1 5018483 5019041 MACS_peak_9 6
";
foreach (split (/\n/, $input)){
next unless $_;
my ($chr, $start, $end, $id, $times) = split (/\s+/);
my $range = int(($end-$start)/$times);
foreach (1..$times-1){
print $chr."\t".($start+$range*($_-1))."\t".($start+$range*$_-1)."\t$_\t".($range-1)."\n";
}
print $chr."\t".($start+$range*($times-1))."\t$end\t$times\t".($end-$start-$range*($times-1)-1)."\n";
}
My solution using R, the input file according to me is the bedGraphs file (or a coverage file obtained from the read file in bed format)
# creating a test data set
data=data.frame(chrom=rep("chr1",2),start=c(37850,820769),end=c(38536,821857),tmp=rep("macsPeak",2),cov=c(7,11))
# or reading the whole file in, remove the head in next line to get it working and put hash in previous
#data=read.csv(commandArgs(TRUE)[1],sep='\t',header=FALSE)
# dividing the coverage (last column of peak file) into divisors of 100 for the whole file
tmp=sapply(data[,3]-data[,2],function(x)c(rep(100,floor(x/100)),x-floor(x/100)*100))
# splitting the file
covList=lapply(1:length(data[,1]),
function(i){
data.frame(chrom=rep(data[i,1],data[i,5]),
start=data[i,2]+c(0,seq(100,(data[i,5]-1)*100,by=101)+1,tail(seq(100,(data[i,5]-1)*100,by=101)+1,1)+101),
end=data[i,2]+c(0,seq(100,(data[i,5]-1)*100,by=101)+1,(tail(seq(100,(data[i,5]-1)*100,by=101)+1,1)+101)-(data[i,5]-1))+tmp[[i]],
seq=seq(1,data[i,5]),reads=c(tmp[[i]][-length(tmp[[i]])],(tail(tmp[[i]],1)-data[i,5])+1))
}
)
# merging the data frames obtained in last step representing the each line of input file
final=do.call(rbind,lapply(1:length(covList),function(i)if(length(covList[[i]])>1)cbind(covList[[i]])))
# writing the file out
write.table(final,'bedGraph_expanded.tsv',sep='\t',row.names=F,quote=F)
Output
chrom start end seq reads
1 chr1 37850 37950 1 100
2 chr1 37951 38051 2 100
3 chr1 38052 38152 3 100
4 chr1 38153 38253 4 100
5 chr1 38254 38354 5 100
6 chr1 38355 38455 6 100
7 chr1 38456 38536 7 80
8 chr1 820769 820869 1 100
9 chr1 820870 820970 2 100
10 chr1 820971 821071 3 100
11 chr1 821072 821172 4 100
12 chr1 821173 821273 5 100
13 chr1 821274 821374 6 100
14 chr1 821375 821475 7 100
15 chr1 821476 821576 8 100
16 chr1 821577 821677 9 100
17 chr1 821678 821778 10 100
18 chr1 821779 821857 11 78
You can save the whole script and use as
Rscript bedGraph2bed.R inputFile.bed
Cheers
bedtools has a utility called makewindows that will do exactly this. It seems that the 5th column in your file is superfluous, as you are really just making 100bp windows from the original intervals. One note in your example is that your start coordinates are one higher than they should be as BED start coordinates are 0-based, whereas the end coordinates are 1-based. I hope this helps.
bedtools makewindows -b data.bed -w 100 -i winnum | awk -v OFS="\t" '{print $0,$3-$2}'
chr1 37850 37950 1 100
chr1 37950 38050 2 100
chr1 38050 38150 3 100
chr1 38150 38250 4 100
chr1 38250 38350 5 100
chr1 38350 38450 6 100
chr1 38450 38536 7 86
chr1 820769 820869 1 100
chr1 820869 820969 2 100
chr1 820969 821069 3 100
chr1 821069 821169 4 100
chr1 821169 821269 5 100
chr1 821269 821369 6 100
chr1 821369 821469 7 100
chr1 821469 821569 8 100
chr1 821569 821669 9 100
chr1 821669 821769 10 100
chr1 821769 821857 11 88
chr1 5018483 5018583 1 100
chr1 5018583 5018683 2 100
chr1 5018683 5018783 3 100
chr1 5018783 5018883 4 100
chr1 5018883 5018983 5 100
chr1 5018983 5019041 6 58
if you really meant to skip one base between each interval, you'll need to change the awk statement. But I think this is what you want. As a control, load your file into UCSC and see what the windows look like.
The only problem is that the next line is not offset from the last by one. You may want to re-check the expected output
My hunch is that the OP didn't intend to create BED windows that skip a base between each window. If intentional, a very minor change to the awk statement will solve this. Updated to demonstrate.
It's sounds like you're probably correct - you see, I don't have any experience with BED files or bedtools. However, I don't think the minor change that you've made to your awk command would work as intended. Did you test the code?
I have followed this from here:
http://stackoverflow.com/questions/13551174/how-to-make-intervals-in-a-range-of-numbers
Here's one way using awk:
awk -v num=100 '{ for (i=1;i<=$NF;i++) { print $1, $2, var=(i==$NF?$3:$2+=num), i, (var-$2==0?"100":var-$2); $2++ } }' file
Results:
chr1 37850 37950 1 100
chr1 37951 38051 2 100
chr1 38052 38152 3 100
chr1 38153 38253 4 100
chr1 38254 38354 5 100
chr1 38355 38455 6 100
chr1 38456 38536 7 80
chr1 820769 820869 1 100
chr1 820870 820970 2 100
chr1 820971 821071 3 100
chr1 821072 821172 4 100
chr1 821173 821273 5 100
chr1 821274 821374 6 100
chr1 821375 821475 7 100
chr1 821476 821576 8 100
chr1 821577 821677 9 100
chr1 821678 821778 10 100
chr1 821779 821857 11 78
chr1 5018483 5018583 1 100
chr1 5018584 5018684 2 100
chr1 5018685 5018785 3 100
chr1 5018786 5018886 4 100
chr1 5018887 5018987 5 100
chr1 5018988 5019041 6 53
seems that we have a whitespace bug that we'll be fixing shortly
I found that tab delimited appears to not be recognized so I had to put the example in 4 space-delimited format
What is it you are trying to do? Summarize/bin data on genomic positions in windows/intervals? Then I suggest you use bedtools makewindows and then bedtools map to map you data on the genomic intervals