Question: How To Make Intervals In A Range Of Numbers?
3
gravatar for daniel.soronellas
16 months ago by
daniel.soronellas220 wrote:

I have a question related on how I could use a range given to split it in intervals of 100 length max. I'm looking for doing this in AWK, Shell Script, Perl or Python. Just to show a small example, I have a tab-delimited file that contains:

  • Chromosome Number (1 to 23)
  • Start range
  • End range
  • ID of the range
  • Number indicating how many times the range have to be splitted

This is how it looks:

chr1    37850      38536      MACS_peak_1    7
chr1    820769     821857     MACS_peak_4    11
chr1    5018483    5019041    MACS_peak_9    6

And after taking each range, and split them in 100 intervals, the output should look like this. The last two columns are not important as much as obtain the ranges splitted, and I showed for this example to indicate the number of split and the length of the split, respectively:

chr1    37850    37950    1    100
chr1    37951    38051    2    100
chr1    38052    38152    3    100
chr1    38153    38253    4    100
chr1    38254    38354    5    100
chr1    38355    38455    6    100
chr1    38456    38536    7    80
chr1    820769    820869    1    100
chr1    820870    820970        2    100
chr1    820971    821071        3    100
chr1    821072    821172        4    100
chr1    821173    821273        5    100
chr1    821274    821374        6    100
chr1    821375    821475        7    100
chr1    821476    821576        8    100
chr1    821577    821677        9    100
chr1    821678    821778        10    100
chr1    821779    821857        11    78
chr1    5018483    5018583    1    100
chr1    5018584    5018684    2    100
chr1    5018685    5018785    3    100
chr1    5018786    5018886    4    100
chr1    5018887    5018987    5    100
chr1    5018988    5019041    6    53
ADD COMMENTlink modified 16 months ago by Aaronquinlan7.3k • written 16 months ago by daniel.soronellas220
1

What is it you are trying to do? Summarize/bin data on genomic positions in windows/intervals? Then I suggest you use bedtools makewindows and then bedtools map to map you data on the genomic intervals

ADD REPLYlink written 16 months ago by Irsan3.4k

seems that we have a whitespace bug that we'll be fixing shortly

ADD REPLYlink modified 16 months ago • written 16 months ago by Istvan Albert ♦♦ 38k

I found that tab delimited appears to not be recognized so I had to put the example in 4 space-delimited format

ADD REPLYlink written 16 months ago by daniel.soronellas220
7
gravatar for a.zielezinski
16 months ago by
a.zielezinski3.0k
a.zielezinski3.0k wrote:

In Python:

fh = open('input_file.txt')
out = open('output_file.txt','w')

def chunk_range(x1, x2, numb):
    size = ((x2 - x1 + 1 ) / numb) + 1
    while x1 <= x2:
        next = x1 + size
        yield x1, min(next - 1, x2)
        x1 = next

for line in fh:
    line = line.split()
    chr = line[0]
    x1 = int(line[1])
    x2 = int(line[2])
    numb = int(line[4])
    n = 1
    for st, end in chunk_range(x1,x2,numb):
        out.write('%s\t%i\t%i\t%i\t%i\n' % (chr, st, end, n, end-st))
        n+=1

fh.close()
out.close()

input_file.txt

chr1    37850      38536      MACS_peak_1    7
chr1    820769     821857     MACS_peak_4    11
chr1    5018483    5019041    MACS_peak_9    6

output_file.txt

chr1    37850       37948       1    98
chr1    37949       38047       2    98
chr1    38048       38146       3    98
chr1    38147       38245       4    98
chr1    38246       38344       5    98
chr1    38345       38443       6    98
chr1    38444       38536       7    92
chr1    820769      820868     1    99
chr1    820869      820968     2    99
chr1    820969      821068     3    99
chr1    821069      821168     4    99
chr1    821169      821268     5    99
chr1    821269      821368     6    99
chr1    821369      821468     7    99
chr1    821469      821568     8    99
chr1    821569      821668     9    99
chr1    821669      821768    10    99
chr1    821769      821857    11    88
chr1    5018483    5018576    1    93
chr1    5018577    5018670    2    93
chr1    5018671    5018764    3    93
chr1    5018765    5018858    4    93
chr1    5018859    5018952    5    93
chr1    5018953    5019041    6    88
ADD COMMENTlink modified 16 months ago • written 16 months ago by a.zielezinski3.0k
5
gravatar for Sukhdeep Singh
16 months ago by
Sukhdeep Singh4.6k
Germany
Sukhdeep Singh4.6k wrote:

My solution using R, the input file according to me is the bedGraphs file (or a coverage file obtained from the read file in bed format)

# creating a test data set
data=data.frame(chrom=rep("chr1",2),start=c(37850,820769),end=c(38536,821857),tmp=rep("macsPeak",2),cov=c(7,11))

# or reading the whole file in, remove the head in next line to get it working and put hash in previous
#data=read.csv(commandArgs(TRUE)[1],sep='\t',header=FALSE)

# dividing the coverage (last column of peak file) into divisors of 100 for the whole file
tmp=sapply(data[,3]-data[,2],function(x)c(rep(100,floor(x/100)),x-floor(x/100)*100))

# splitting the file
covList=lapply(1:length(data[,1]),
   function(i){
     data.frame(chrom=rep(data[i,1],data[i,5]),
              start=data[i,2]+c(0,seq(100,(data[i,5]-1)*100,by=101)+1,tail(seq(100,(data[i,5]-1)*100,by=101)+1,1)+101),
              end=data[i,2]+c(0,seq(100,(data[i,5]-1)*100,by=101)+1,(tail(seq(100,(data[i,5]-1)*100,by=101)+1,1)+101)-(data[i,5]-1))+tmp[[i]],
              seq=seq(1,data[i,5]),reads=c(tmp[[i]][-length(tmp[[i]])],(tail(tmp[[i]],1)-data[i,5])+1))
     }
   )

# merging the data frames obtained in last step representing the each line of input file
final=do.call(rbind,lapply(1:length(covList),function(i)if(length(covList[[i]])>1)cbind(covList[[i]])))

# writing the file out
write.table(final,'bedGraph_expanded.tsv',sep='\t',row.names=F,quote=F)

Output

chrom  start    end seq reads
1   chr1  37850  37950   1   100
2   chr1  37951  38051   2   100
3   chr1  38052  38152   3   100
4   chr1  38153  38253   4   100
5   chr1  38254  38354   5   100
6   chr1  38355  38455   6   100
7   chr1  38456  38536   7    80
8   chr1 820769 820869   1   100
9   chr1 820870 820970   2   100
10  chr1 820971 821071   3   100
11  chr1 821072 821172   4   100
12  chr1 821173 821273   5   100
13  chr1 821274 821374   6   100
14  chr1 821375 821475   7   100
15  chr1 821476 821576   8   100
16  chr1 821577 821677   9   100
17  chr1 821678 821778  10   100
18  chr1 821779 821857  11    78

You can save the whole script and use as Rscript bedGraph2bed.R inputFile.bed

Cheers

ADD COMMENTlink modified 16 months ago • written 16 months ago by Sukhdeep Singh4.6k
4
gravatar for qiyunzhu
16 months ago by
qiyunzhu360
Buffalo
qiyunzhu360 wrote:

Here's my Perl solution:

my $input = "
chr1    37850      38536      MACS_peak_1    7
chr1    820769     821857     MACS_peak_4    11
chr1    5018483    5019041    MACS_peak_9    6
";

foreach (split (/\n/, $input)){
    next unless $_;
    my ($chr, $start, $end, $id, $times) = split (/\s+/);
    my $i = 1;
    for ($i=1; $i<$times; $i++){
        print $chr."\t".($start+($i-1)*101)."\t".($start+$i*101)."\t$i\t100\n";
    }
    print $chr."\t".($start+($i-1)*101)."\t$end\t$i\t".($end-$start-($i-1)*101)."\n";
}

The output should be exactly the same as in your question.

And here's a Perl code to make the intervals even:

my $input = "
chr1    37850      38536      MACS_peak_1    7
chr1    820769     821857     MACS_peak_4    11
chr1    5018483    5019041    MACS_peak_9    6
";

foreach (split (/\n/, $input)){
    next unless $_;
    my ($chr, $start, $end, $id, $times) = split (/\s+/);
    my $range = int(($end-$start)/$times);
    foreach (1..$times-1){
        print $chr."\t".($start+$range*($_-1))."\t".($start+$range*$_-1)."\t$_\t".($range-1)."\n";
    }
    print $chr."\t".($start+$range*($times-1))."\t$end\t$times\t".($end-$start-$range*($times-1)-1)."\n";
}
ADD COMMENTlink modified 16 months ago • written 16 months ago by qiyunzhu360
4
gravatar for Aaronquinlan
16 months ago by
Aaronquinlan7.3k
Aaronquinlan7.3k wrote:

bedtools has a utility called makewindows that will do exactly this. It seems that the 5th column in your file is superfluous, as you are really just making 100bp windows from the original intervals. One note in your example is that your start coordinates are one higher than they should be as BED start coordinates are 0-based, whereas the end coordinates are 1-based. I hope this helps.

bedtools makewindows -b data.bed -w 100 -i winnum | awk -v OFS="\t" '{print $0,$3-$2}'
chr1    37850    37950    1    100
chr1    37950    38050    2    100
chr1    38050    38150    3    100
chr1    38150    38250    4    100
chr1    38250    38350    5    100
chr1    38350    38450    6    100
chr1    38450    38536    7    86
chr1    820769    820869    1    100
chr1    820869    820969    2    100
chr1    820969    821069    3    100
chr1    821069    821169    4    100
chr1    821169    821269    5    100
chr1    821269    821369    6    100
chr1    821369    821469    7    100
chr1    821469    821569    8    100
chr1    821569    821669    9    100
chr1    821669    821769    10    100
chr1    821769    821857    11    88
chr1    5018483    5018583    1    100
chr1    5018583    5018683    2    100
chr1    5018683    5018783    3    100
chr1    5018783    5018883    4    100
chr1    5018883    5018983    5    100
chr1    5018983    5019041    6    58

if you really meant to skip one base between each interval, you'll need to change the awk statement. But I think this is what you want. As a control, load your file into UCSC and see what the windows look like.

ADD COMMENTlink modified 16 months ago • written 16 months ago by Aaronquinlan7.3k

The only problem is that the next line is not offset from the last by one. You may want to re-check the expected output

ADD REPLYlink written 16 months ago by steve40

My hunch is that the OP didn't intend to create BED windows that skip a base between each window. If intentional, a very minor change to the awk statement will solve this. Updated to demonstrate.

ADD REPLYlink modified 16 months ago • written 16 months ago by Aaronquinlan7.3k

It's sounds like you're probably correct - you see, I don't have any experience with BED files or bedtools. However, I don't think the minor change that you've made to your awk command would work as intended. Did you test the code?

ADD REPLYlink written 16 months ago by steve40

Right you are. Removed the edit under the assumption that this is what they want.

ADD REPLYlink written 16 months ago by Aaronquinlan7.3k
3
gravatar for steve
16 months ago by
steve40
steve40 wrote:

I have followed this from here:

http://stackoverflow.com/questions/13551174/how-to-make-intervals-in-a-range-of-numbers

Here's one way using awk:

awk -v num=100 '{ for (i=1;i<=$NF;i++) { print $1, $2, var=(i==$NF?$3:$2+=num), i, (var-$2==0?"100":var-$2); $2++ } }' file

Results:

chr1 37850 37950 1 100
chr1 37951 38051 2 100
chr1 38052 38152 3 100
chr1 38153 38253 4 100
chr1 38254 38354 5 100
chr1 38355 38455 6 100
chr1 38456 38536 7 80
chr1 820769 820869 1 100
chr1 820870 820970 2 100
chr1 820971 821071 3 100
chr1 821072 821172 4 100
chr1 821173 821273 5 100
chr1 821274 821374 6 100
chr1 821375 821475 7 100
chr1 821476 821576 8 100
chr1 821577 821677 9 100
chr1 821678 821778 10 100
chr1 821779 821857 11 78
chr1 5018483 5018583 1 100
chr1 5018584 5018684 2 100
chr1 5018685 5018785 3 100
chr1 5018786 5018886 4 100
chr1 5018887 5018987 5 100
chr1 5018988 5019041 6 53
ADD COMMENTlink modified 16 months ago • written 16 months ago by steve40

+1 Awk is really powerful :)

ADD REPLYlink written 16 months ago by Sukhdeep Singh4.6k
0
gravatar for daniel.soronellas
16 months ago by
daniel.soronellas220 wrote:

Thanks everyone for your help! it worked!!!! :)

ADD COMMENTlink written 16 months ago by daniel.soronellas220

This isn't an answer, so you should put it as a comment in your original question and delete this "answer". Glad you found a solution to your problem!

ADD REPLYlink written 16 months ago by Josh Herr4.1k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 375 users visited in the last hour