Question: separating bed file into separate files for every 100000 base pairs
0
gravatar for chrisclarkson100
17 months ago by
European Union
chrisclarkson10050 wrote:

I have a really large bed file (24 GB) and I want to separate it into different files for every million base pairs. I have done it the pure programming way but this really slow

python

file=open(sys.argv[1], 'r')
lines=file.readlines()
#print lines
start=[]
end=[]
chr=[]
for a in lines:
    a=a.rstrip()
    a=a.split()
    start.append(int(a[1]))
    end.append(int(a[2]))
    chr.append(a[0])
i_s = 0
while i_s < len(start):
    i_e = i_s
    while i_e < len(end):
        if end[i_e] - start[i_s] > 1000000:
            regions=open(str(counter)+'OI.txt', 'a')
            for i in xrange(i_s,i_e):
                regions.write(str(chr[i])+'\t'+str(start[i])+'\t'+str(end[i])+'\n')
            i_s = i_e
            regions.close()
            break
        i_e += 1
    i_s += 1

R

args<-commandArgs(TRUE)
library(readr)
df=read_delim(args[1], delim='\t', col_names = F)
density=data.frame(0)
head(df)
i_s = 1
counter=1
while(i_s < nrow(df)){
  i_e = i_s
  while(i_e < nrow(df)){
    if(df[i_e,3] - df[i_s,2] > 1000){
      write.table(paste0(args[1], 'OI.txt'), sep=''\t', col.names=F, row.names=F, quote=F)
      i_s = i_e
      break
    } 
    i_e = i_e + 1
  }
  i_s = i_s + 1
}

Both of these ways are extremely slow.... Does anyone know of a faster way to achieve this?

assembly • 668 views
ADD COMMENTlink modified 16 months ago • written 17 months ago by chrisclarkson10050

In python script, you are opening file multiple times until while loop ends. Put the open file statement above second while loop. It will increase a speed. Also, the input file is 24 GB, it will take some time.

To increase your speed, try to reduce while loops or use parallel approach.

ADD REPLYlink written 17 months ago by Renesh1.5k
2
gravatar for Kevin Blighe
17 months ago by
Kevin Blighe37k
Republic of Ireland
Kevin Blighe37k wrote:

This is my own example, which may or may not be of use to you:

awk 'BEGIN {chr="chr1"; count=1; file=chr"Block"count".bed"; previouspos=$3} {sum+=(($3-$2)+1); if (sum<1000000 && $1==chr && (($3-previouspos)+1)<1000000) {print >> file}; if (sum>=1000000 || $1!=chr || (($3-previouspos)+1)>=1000000) {chr=$1; count+=1; file=chr"Block"count".bed"; print >> file; sum=(($3-$2)+1)} previouspos=$3} {if ($1!=chr) count=1}' MyBed.bed

It breaks up into blocks of 1,000,000bp and also by chromosome, and outputs these to separate files.

The only assumption you have to ensure is that the first entry begins with 'chr1'

ADD COMMENTlink modified 17 months ago • written 17 months ago by Kevin Blighe37k
1

I see that my answer was accepted - thanks! However, there are critical assumptions that I make with my one-line awk command above.

Here is a much more robust version that does the following:

  • Sorts the BED file
  • Splits individual regions greater than 1 megabase into multiple regions
  • Breaks up the BED file into multiple BED files, each containing regions totalling 1 megabase or less in length

.

sort -k 1,1 -k2,2n MyBed.bed |
awk '{if ((($3-$2)+1)>1000000) {numintervals=int((((($3-$2)+1)/1000000)+1)); for (i=1; i<=numintervals; ++i) if (i==1) {print $1"\t"$2"\t"($2+999999)} else if (i==numintervals) {print $1"\t"$2+((999999+1)*(i-1))"\t"$3} else {print $1"\t"$2+((999999+1)*(i-1))"\t"($2+((999999+1)*(i-1)))+999999}} else {print}}' |
awk 'BEGIN {chr="chr1"; count=1; file=chr"Block"count".bed"; previouspos=$3} {sum+=(($3-$2)+1); if (sum<1000000 && $1==chr && (($3-previouspos)+1)<1000000) {print >> file}; if (sum>=1000000 || $1!=chr || (($3-previouspos)+1)>=1000000) {chr=$1; count+=1; file=chr"Block"count".bed"; print >> file; sum=(($3-$2)+1)} previouspos=$3} {if ($1!=chr) count=1}'
ADD REPLYlink modified 17 months ago • written 17 months ago by Kevin Blighe37k
2
gravatar for Alex Reynolds
17 months ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

Use BEDOPS bedextract to quickly split the BED file into separate chromosomes:

$ for chr in `bedextract --list-chr monolithic.bed`; do bedextract ${chr} monolithic.bed > ${chr}.bed; done

Once you have per-chromosome files, you can parallelize the work very easily.

Here is a Python script that applies a first-fit algorithm for binning elements up to some maximum size (1e8 bases in the example in the Gist, but you can specify this).

Ideally you would put the work of binning each chromosome onto separate compute nodes.

ADD COMMENTlink modified 17 months ago • written 17 months ago by Alex Reynolds27k

Hi, thank you for this, In your GitHub example you input:

first_fit.py hg19.gapped_extents.bed 100 1e8

I have tried your example on my own data and it looks encouraging but I am looking to write the lines within a certain base pair range to separate files. This script does not seem to do that. The 'bin' objects seem to be in dictionaries with the chromosome as a key and range as a value... could you show me where in the code where I could modify it such that each of the elements in each bin are written to a separate file i.e. if bin1 has 20 elements could each of these be written to a '1.bed' and the same for bin2, 3.....

what difference does the input of 100 make (if I were to increase the bin size to 1000 would it make it faster)?

ADD REPLYlink modified 17 months ago • written 17 months ago by chrisclarkson10050
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 613 users visited in the last hour