Question

how to split "blocks" in one entry of bed file into different entries

0

Entering edit mode

7.5 years ago

scottzijiezhang • 0

Hi all, I have a peak calling results stored in bed file. Some of the peaks are stored in one entry as different blocks. I want to split those blocks each into a separate entry using the information provided in blockCount, blockSizes, blockStarts.

For example, Input:

chr   chromStart      chromEnd        name    score   strand  thickStart      thickEnd        itemRgb blockCount      blockSizes      blockStarts
chr1    943253  943774  SAMD11  2e-59   +       943253  943774  0       2       124,77, 0,444
chr1    944351  944581  SAMD11  5e-35   +       944351  944581  0       1       230,    0

output:

chr1    943253    943377    SAMD11    2e-59    +    943253    943377    0    1    124,    0
chr1    943697    943774    SAMD11    2e-59    +    943697    943774    0    1    77,    0
chr1    944351  944581  SAMD11  5e-35   +       944351  944581  0       1       230,    0

Note, in the example, the first entry contains two peaks (blocks) and the second contains only one. In my data, some of my entries contain two or multiple (more than two) peaks (blocks) while others contain only one.

Can this be done by some tools or codes? Any help is appreciated!

Thanks

Scott

bed format peak • 2.7k views

ADD COMMENT • link updated 7.5 years ago by Alex Reynolds 35k • written 7.5 years ago by scottzijiezhang • 0

0

Entering edit mode

This looks like genepred format, I could find something to convert it to gtf which may be a step in the right direction.

ADD REPLY • link 7.5 years ago by WouterDeCoster 47k

0

Entering edit mode

It's bed12 format, for what it's worth.

ADD REPLY • link 7.5 years ago by Devon Ryan 104k

score 2 · Answer 1 · 2016-11-06

2

Entering edit mode

7.5 years ago

Alex Reynolds 35k

This should get you about 98% of the way there:

$ awk 'BEGIN{ OFS="\t" } NR>1{ blockCount=$10; blockSizesStr=$11; blockStartsStr=$12; blockSizesCount=split(blockSizesStr,blockSizes,","); blockStartsCount=split(blockStartsStr,blockStarts,","); for(i=1;i<=blockCount;i++) { $2+=blockStarts[i]; $3=$2+blockSizes[i]; print $0; } }' intervals.bed12
chr1    943253  943377  SAMD11  2e-59   +   943253  943774  0   2   124,77, 0,444
chr1    943697  943774  SAMD11  2e-59   +   943253  943774  0   2   124,77, 0,444
chr1    944351  944581  SAMD11  5e-35   +   944351  944581  0   1   230,    0

You can fiddle with the print statements, as needed.

ADD COMMENT • link 7.5 years ago by Alex Reynolds 35k

0

Entering edit mode

Thank you so much Alex! your code helps a lot! With just a minor change as below:

awk 'BEGIN{ OFS="\t" } NR>1{ originalStart=$2; newStart=$2; blockCount=$10; blockSizesStr=$11; blockStartsStr=$12; blockSizesCount=split(blockSizesStr,blockSizes,","); blockStartsCount=split(blockStartsStr,blockStarts,","); for(i=1;i<=blockCount;i++) { $2=$7+blockStarts[i]; $3=$2+blockSizes[i]; $10=1; $11=blockSizes[i]; $12=0; print $0; } }' con_peak.bed > tmp.bed

I was able to get a bed file with every entry as individual peaks.

ADD REPLY • link 7.5 years ago by scottzijiezhang • 0