Question: split bed file into several bed files where each region is separated of any other by N bases
1
gravatar for 14134125465346445
2.8 years ago by
United Kingdom
141341254653464453.4k wrote:

I have a bed file with regions of interest, and I would like to split it into a minimum number of bed files each of which will have all regions separated to each other by N bases or more. Any ideas what tool I could use?

E.g. input.bed

chr1 1000 1050
chr1 1080 1130
chr1 2000 2050

Would be split by:

split_by_distance -n 150 -i input.bed

And would produce:

input0001.bed

chr1 1000 1050

input0002.bed

chr1 1080 1130
chr1 2000 2050

Explained graphically:

Original file has 3 entries

Minimum distance:

[xxxxxxxxx]

First and second are too close:

       [xxxxxxxxx]
##########
               ##########
                                        ##########

Output is:

file1: first
file2: second and third

Another example:

Minimum distance: [xxxxxx]

        [xxxxxx]     [xxxxxx]      [xxxxxx]        [xxxxxx]
AAAAAAAAAA   BBBBBBBBBB   CCCCCCCCCC     DDDDDDDDD          EEEEEEEEEE

Output files:

File1:

AAAAAAAAAA                CCCCCCCCCC                        EEEEEEEEEE

File2:

             BBBBBBBBBB                  DDDDDDDDD

Thx

bedops bed bedtools • 1.8k views
ADD COMMENTlink modified 3 months ago by RamRS19k • written 2.8 years ago by 141341254653464453.4k

I don't think this completely solves the problem. But bedtools makewindows might be something to look into. It does not output the results into multiple bed files unfortunately, but it's a start.

ADD REPLYlink modified 3 months ago by RamRS19k • written 2.8 years ago by cbio400

So row1 is in its own file because row2 is less than 150bp away, but row3 is with row2 because its further than 150bp away?

wat.

ADD REPLYlink modified 3 months ago by RamRS19k • written 2.8 years ago by John12k

I think he might have added the 1 in -n 150 by accident. We shouldn't assume, but it seems that he wants to be able to split his input bed file into multiple bedfiles based on the -n number. So every n bases the input would be split into a different file. I'm not sure WHY exactly.

ADD REPLYlink modified 3 months ago by RamRS19k • written 2.8 years ago by cbio400

I could understand that - but i think he actually wants to split the file so that subfile represent a "lone inverval", or a cluster of intervals. Perhaps from a peak caller. Also, not exactly sure why :P

ADD REPLYlink modified 3 months ago by RamRS19k • written 2.8 years ago by John12k
3
gravatar for Pierre Lindenbaum
2.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum115k wrote:

I quickly wrote something

java -jar dist/biostar178713.jar -d 100000 -o out.zip in1.bed in2.bed 

[main] INFO jvarkit - sorting 1524
[main] INFO jvarkit - creating zip jeter.zip
[main] INFO jvarkit - creating bed001.bed
[main] INFO jvarkit - closing bed001.bed N=186
[main] INFO jvarkit - creating bed002.bed
[main] INFO jvarkit - closing bed002.bed N=143
[main] INFO jvarkit - creating bed003.bed
[main] INFO jvarkit - closing bed003.bed N=122

 

$ unzip -t out.zip

Archive:  jeter.zip
    testing: bed001.bed               OK
    testing: bed002.bed               OK
    testing: bed003.bed               OK
No errors detected in compressed data of jeter.zip.
ADD COMMENTlink modified 3 months ago by RamRS19k • written 2.8 years ago by Pierre Lindenbaum115k
0
gravatar for dariober
2.8 years ago by
dariober9.8k
Glasgow - UK
dariober9.8k wrote:

I think you can use bedtools spacing command:

bedtools spacing -i in.bed | awk '$4 > 150' > out1.bed
bedtools spacing -i in.bed | awk '$4 < 150' > out2.bed

Assuming here your input bed as only 3 columns.

ADD COMMENTlink modified 3 months ago by RamRS19k • written 2.8 years ago by dariober9.8k

I could not find the spacing command in bedtools or bedtools2... Any ideas?

ADD REPLYlink written 2.8 years ago by 141341254653464453.4k

What version of bedtools do you have? The spacing command has been included starting from release 2.23. Here's the help:

bedtools spacing

Tool:    bedtools spacing
Version: v2.25.0
Summary: Report (last col.) the gap lengths between intervals in a file.

Usage:   bedtools spacing [OPTIONS] -i <bed/gff/vcf/bam>

Notes: 
    (1)  Input must be sorted by chrom,start (sort -k1,1 -k2,2n for BED).
    (2)  The 1st element for each chrom will have NULL distance. (".").
    (3)  Distance for overlapping intervaks is -1 and bookended is 0.

Example: 
    $ cat test.bed 
    chr1    0   10 
    chr1    10  20 
    chr1    21  30 
    chr1    35  45 
    chr1    100 200 

    $ bedtools spacing -i test.bed 
    chr1    0   10  . 
    chr1    10  20  0 
    chr1    21  30  1 
    chr1    35  45  5 
    chr1    100 200 55 

    -bed    If using BAM input, write output as BED.

    -header Print the header from the A file prior to results.

    -nobuf  Disable buffered output. Using this option will cause each line
        of output to be printed as it is generated, rather than saved
        in a buffer. This will make printing large output files 
        noticeably slower, but can be useful in conjunction with
        other software tools and scripts that need to process one
        line of bedtools output at a time.

    -iobuf  Specify amount of memory to use for input buffer.
        Takes an integer argument. Optional suffixes K/M/G supported.
        Note: currently has no effect with compressed files.
ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by dariober9.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1134 users visited in the last hour