separating bed file into separate files for every 100000 base pairs
2
0
Entering edit mode
6.6 years ago

I have a really large bed file (24 GB) and I want to separate it into different files for every million base pairs. I have done it the pure programming way but this really slow

python

file=open(sys.argv[1], 'r')
lines=file.readlines()
#print lines
start=[]
end=[]
chr=[]
for a in lines:
    a=a.rstrip()
    a=a.split()
    start.append(int(a[1]))
    end.append(int(a[2]))
    chr.append(a[0])
i_s = 0
while i_s < len(start):
    i_e = i_s
    while i_e < len(end):
        if end[i_e] - start[i_s] > 1000000:
            regions=open(str(counter)+'OI.txt', 'a')
            for i in xrange(i_s,i_e):
                regions.write(str(chr[i])+'\t'+str(start[i])+'\t'+str(end[i])+'\n')
            i_s = i_e
            regions.close()
            break
        i_e += 1
    i_s += 1

R

args<-commandArgs(TRUE)
library(readr)
df=read_delim(args[1], delim='\t', col_names = F)
density=data.frame(0)
head(df)
i_s = 1
counter=1
while(i_s < nrow(df)){
  i_e = i_s
  while(i_e < nrow(df)){
    if(df[i_e,3] - df[i_s,2] > 1000){
      write.table(paste0(args[1], 'OI.txt'), sep=''\t', col.names=F, row.names=F, quote=F)
      i_s = i_e
      break
    } 
    i_e = i_e + 1
  }
  i_s = i_s + 1
}

Both of these ways are extremely slow.... Does anyone know of a faster way to achieve this?

Assembly • 2.4k views
ADD COMMENT
0
Entering edit mode

In python script, you are opening file multiple times until while loop ends. Put the open file statement above second while loop. It will increase a speed. Also, the input file is 24 GB, it will take some time.

To increase your speed, try to reduce while loops or use parallel approach.

ADD REPLY
2
Entering edit mode
6.6 years ago

This is my own example, which may or may not be of use to you:

awk 'BEGIN {chr="chr1"; count=1; file=chr"Block"count".bed"; previouspos=$3} {sum+=(($3-$2)+1); if (sum<1000000 && $1==chr && (($3-previouspos)+1)<1000000) {print >> file}; if (sum>=1000000 || $1!=chr || (($3-previouspos)+1)>=1000000) {chr=$1; count+=1; file=chr"Block"count".bed"; print >> file; sum=(($3-$2)+1)} previouspos=$3} {if ($1!=chr) count=1}' MyBed.bed

It breaks up into blocks of 1,000,000bp and also by chromosome, and outputs these to separate files.

The only assumption you have to ensure is that the first entry begins with 'chr1'

ADD COMMENT
1
Entering edit mode

I see that my answer was accepted - thanks! However, there are critical assumptions that I make with my one-line awk command above.

Here is a much more robust version that does the following:

  • Sorts the BED file
  • Splits individual regions greater than 1 megabase into multiple regions
  • Breaks up the BED file into multiple BED files, each containing regions totalling 1 megabase or less in length

.

sort -k 1,1 -k2,2n MyBed.bed |
awk '{if ((($3-$2)+1)>1000000) {numintervals=int((((($3-$2)+1)/1000000)+1)); for (i=1; i<=numintervals; ++i) if (i==1) {print $1"\t"$2"\t"($2+999999)} else if (i==numintervals) {print $1"\t"$2+((999999+1)*(i-1))"\t"$3} else {print $1"\t"$2+((999999+1)*(i-1))"\t"($2+((999999+1)*(i-1)))+999999}} else {print}}' |
awk 'BEGIN {chr="chr1"; count=1; file=chr"Block"count".bed"; previouspos=$3} {sum+=(($3-$2)+1); if (sum<1000000 && $1==chr && (($3-previouspos)+1)<1000000) {print >> file}; if (sum>=1000000 || $1!=chr || (($3-previouspos)+1)>=1000000) {chr=$1; count+=1; file=chr"Block"count".bed"; print >> file; sum=(($3-$2)+1)} previouspos=$3} {if ($1!=chr) count=1}'
ADD REPLY
2
Entering edit mode
6.6 years ago

Use BEDOPS bedextract to quickly split the BED file into separate chromosomes:

$ for chr in `bedextract --list-chr monolithic.bed`; do bedextract ${chr} monolithic.bed > ${chr}.bed; done

Once you have per-chromosome files, you can parallelize the work very easily.

Here is a Python script that applies a first-fit algorithm for binning elements up to some maximum size (1e8 bases in the example in the Gist, but you can specify this).

Ideally you would put the work of binning each chromosome onto separate compute nodes.

ADD COMMENT
0
Entering edit mode

Hi, thank you for this, In your GitHub example you input:

first_fit.py hg19.gapped_extents.bed 100 1e8

I have tried your example on my own data and it looks encouraging but I am looking to write the lines within a certain base pair range to separate files. This script does not seem to do that. The 'bin' objects seem to be in dictionaries with the chromosome as a key and range as a value... could you show me where in the code where I could modify it such that each of the elements in each bin are written to a separate file i.e. if bin1 has 20 elements could each of these be written to a '1.bed' and the same for bin2, 3.....

what difference does the input of 100 make (if I were to increase the bin size to 1000 would it make it faster)?

ADD REPLY

Login before adding your answer.

Traffic: 2912 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6